3.3 Estimator quality, confidence sets and bootstrapping

Size: px

Start display at page:

Download "3.3 Estimator quality, confidence sets and bootstrapping"

Felicia Jacobs
5 years ago
Views:

1 Estimator quality, confidence sets and bootstrapping Estimator quality, confidence sets and bootstrapping A comparison of two estimators is always a matter of comparing their respective distributions. A good estimator is an estimator with a distribution (under P θ ) that is closely centred around θ for all θ Θ. It is worth noticing that an estimator should work well for all values of the parameter. Fixing a single value, θ 0 Θ, say, we can define the estimator ˆθ by ˆθ(x) = θ 0 for all x. This estimator works extremely well when θ = θ 0, but it works terribly if θ is far from θ 0. Problems like this make it troublesome in general to decide what properties an optimal estimator should have. Some attempts to define good and bad properties of an estimator in terms of its distribution are described in the next section. The distribution of an estimator is not only used for comparing estimators but also for judging how uncertain an estimate is. This is usually done via confidence intervals. The construction of confidence intervals relies in principle on explicit knowledge about the distribution of the estimator something we often don t have but approximating methods can be used instead. One of these, bootstrapping, is a generally applicable method that in practice relies only on the ability to do some simulations Mean Squared Error The expectation of a random variable X under the probability measure P θ is denoted E θ X. Definition If ˆθ is an estimator of a real valued parameter, Θ R, then the mean squared error of the estimator is MSE θ (ˆθ) = E θ (ˆθ θ) 2. Note that the mean squared error is a function of θ. For a given θ the number MSE θ (ˆθ) is a measure of how far the estimator on average is from θ. This is a reasonable measure of how good (or bad) the estimator is. The smaller MSE θ (ˆθ) the better. As shown in (2.11) the mean squared error can be decomposed into two terms: MSE θ (ˆθ) = V θˆθ + (Eθˆθ θ) 2. The first term is by definition the variance of the estimator and the second term is called the squared bias of the estimator. An estimator is called unbiased if the second term vanish for all θ, i.e. an estimator is unbiased if E θˆθ = θ for all θ Θ. Unbiasedness seems to be a desirable property an unbiased estimator is on average right and a lot of effort has been invested in developing bias-reducing methods. We saw for instance that the MLE of the variance for a normal distribution was corrected

2 110 Statistics to give an unbiased estimator of the variance. But changing the estimator to remove the squared bias term from MSE θ (ˆθ) may, however, increase the variance of the estimator to such an extent that the mean squared error becomes larger. If the objective is to minimise the mean squared error of the estimator then one should try to find a suitable tradeoff between the squared bias term and the variance term. This bias-variance-tradeoff is a matter of balancing the different types of errors the systematic error (the bias) and the random error (measured by the variance). Example If X 1,...,X n under P p are iid Bernoulli variables, p [0,1], with the parameter p being the probability that X i equals 1, then ˆp = 1 n n X i is an estimator of p (in fact the MLE). The distribution of nˆp = n X i under P p is a binomial distribution with success parameter p and size parameter n, but we don t need this information to find directly that E pˆp = 1 n n E p X i = 1 n np = p so the estimator is unbiased. The mean squared error of the estimator therefore equals the variance MSE p (ˆp) = V pˆp = 1 n n 2 V p X i = 1 p(1 p) n2np(1 p) =. n Example (Evolutionary distance). We continue Example and consider the MLE ˆt = 1 4α log 3n 3n 1 n 2 for the evolutionary distance under the Jukes-Cantor model. Note that the estimator is only defined when 3n 1 > n 2. The distribution and hence the mean and variance of this estimator is therefore not really defined, but if we condition on the event that 3n 1 > n 2 we can find the conditional distribution and hence the conditional mean and variance of the estimator. Rewriting the formula for the estimator gives that ˆt = 1 ( log 3n log(4n1 n) ) 4α where we have used that n 1 + n 2 = n. Recall that n 1 is the number of pairs (x i,y i ) with x i = y i. Introducing the Bernoulli random variable Z i = 1(X i = Z i ), which is one if X i = Y i and zero otherwise, we see that n 1 is the realisation of N 1 = n Z i.

3 Estimator quality, confidence sets and bootstrapping 111 Bias of the MLE in the Jukes Cantor model Variance of the MLE in the Jukes Cantor model Squared bias Variance time time Figure 3.6: The graphs show the squared bias (left) and the variance (right) of the MLE as a function of the time parameter t in a Jukes-Cantor model with α = 10 4 when having observed 100 aligned letters. The bias as well as the variance are in principle computed conditionally on 3n 1 > n 2, but for t in the range considered here this event essentially has probability 1. The estimator is clearly biased and the bias and variance both increase as t increases. The mean squared error of the estimator is the sum of the squared bias and the variance. The bias is, for t in the this range, negligible compared to the variance. See also Figure 3.7. The distribution of N 1 is a binomial distribution with success probability p(t) = exp( 4αt) under P t and size parameter n. Regarded as a random variable the distribution of ˆt = 1 ( log 3n log(4n1 n) ) 4α conditionally on 4N 1 > n is given as a transformation of a binomial distribution conditionally on 4N 1 > n. For instance we find that the conditional mean is E t (ˆt 4N 1 > n) = 1 ( log 3n Et (log(4n 1 n) 4N 1 > n) ) 4α where n E t (log(4n 1 n) 4N 1 > n) = 1 :4n 1 >n log(4n 1 n) ( n ) n 1 p(t) n 1 (1 p(t)) n n 1. P t (4N 1 > n) The variance can be represented likewise. It is difficult to use this formula and the corresponding variance formula to get any further theoretically, but they can be used computationally. Figure 3.6 shows how the (conditional) bias and variance depend upon the time

4 112 Statistics Bias of the MLE in the Jukes Cantor model Variance of the MLE in the Jukes Cantor model Squared bias 0.0 e e e e Variance time time Figure 3.7: The graphs show the squared bias (left) and the variance (right) of the MLE as a function of the time parameter t in a Jukes-Cantor model with α = 10 4 when having observed 100 aligned letters. Compare with Figure 3.6. It is noticeable that the variance suddenly drops around t=9000 while the squared bias steadily increases. This is explained by the condition 3n 1 > n 2 for the existence of the MLE. Note also the the squared bias is not negligible compared to the variance. parameter t in the range 1 to 1000 for α = 10 4 and n = 100. The MLE is clearly biased but we observe that in this range the bias is negligible compared to the variance. Figure 3.7 shows another picture for t in the range 8000 to Here the variance reaches a maximum at around 9000 and declines thereafter whereas the squared bias increases steadily. The squared bias is no longer negligible. The explanation is that when t gets large the event 3n 1 > n 2 occurs with probability approaching 0.5 (and the probability is in particular less than 1 also from a practical point of view). The estimator will, when 3n 1 > n 2, take values less than (but close to) 2500log = The result is that as t grows the squared bias increases whereas the variance of the estimator stays bounded Confidence intervals If (P θ ) θ Θ is a parameterised family of probability measures on E, and if we have an observation x E, then an estimator ˆθ : E Θ produces an estimate ˆϑ = ˆθ(x) Θ. If the observation came to be as a realisation of an experiment that was governed by one

5 Estimator quality, confidence sets and bootstrapping µ^ µ^ µ µ Figure 3.8: These figures show the density for the distribution of the estimator ˆµ regarded as a two-dimensional function, see Example 3.3.4, with n = 10 (left) and n = 50 (right). The darker a colour the higher a value of the density. For a given estimate ˆµ(x) = y we can read of which µ (those where the point (y, µ) is coloured) that could produce such an estimate. Note that the dark band is most narrow when n is largest. probability measure P θ in our parameterised family (thus the true parameter is θ), then it is rather unlikely that we have ˆθ(x) = θ but we certainly hope that the estimate and the true value are not too far apart. We have seen how the distribution of the estimator when regarded as a random variable tells us how estimates will deviate from the true θ. The information flow in the sentence above is that if θ is the true parameter, then the distribution of ˆθ under P θ tells how far from θ we can expect to find realisations of theta. ˆ We want to turn things upside down and for a given estimate tell how far it is from the true θ. What we are going to do is to combine knowledge about the distribution of ˆθ for all θ Θ with an observation x E and convert this into knowledge about which parameters we believe could have produced x. We first illustrate the main line of thought with two examples. Example Let X 1,...,X n be iid N(µ,1) distributed with µ R. Thus our parameter space is R and the unknown parameter is the mean µ in the normal distribution. Our sample space is R n and the observation is an n-dimensional vector x = (x 1,...,x n ). We will consider the estimator ˆµ = 1 n X k, n k=1 which is the empirical mean (and in this case the MLE as well). The distribution of ˆµ is a N(µ,1/n) distribution.

6 114 Statistics p^ p^ p p Figure 3.9: These figures show the point probabilities for the distribution of the estimator ˆp regarded as a two-dimensional function, see Example 3.3.4, with n = 10 (left) and n = 50 (right). The darker a colour the higher a value of the point probabilities. For a given estimate ˆp(x) = y we can read of which p (those where the point (p, y) is coloured) that could produce such an estimate. Note the cigar shape of the high values. Consider the density for the distribution of ˆµ as a function of two variables (y,µ) 1 ) (y µ)2 exp (, (y,µ) R R. 2πn 2n For a given µ this is simply the density for the distribution of ˆµ as a function of y, but when we change µ the density changes as well. A combination of (y,µ) where the density takes a high value has the interpretation that y is a likely estimate if µ is the true parameter. On Figure 3.8 we see two examples (n = 10 and n = 50) that illustrate how the density as a function of (y,µ) behaves. For a given estimate y = ˆµ(x) we can from the figure read of, which values of µ that make the estimate likely and which that do not. Example Let X 1,...,X n be iid Bernoulli distributed with success probability p [0,1]. Our parameter space is [0,1] and the unknown parameter is the success probability p. Our sample space is {0,1} n and the observation is an n-dimensional vector x = (x 1,...,x n ) of 0-1-variables. We will consider the estimator ˆp = 1 n n X k, k=1 which is the relative frequency of 1 s (and the MLE as well). The distribution of nˆp =

7 Estimator quality, confidence sets and bootstrapping 115 n k=1 X k is a binomial distribution with parameters (n,p), which implicitly 2 gives the distribution of ˆp. We consider here the point probabilities for the distribution ˆp as a function of two variables ( ) n (y,p) p ny (1 p) n ny, (y,p) {0,1/n,...,1} [0,1]. ny For a given p these probabilities are the point probabilities for observing ˆp(x) = y as a function of y, but when we change p the point probabilities change as well. A combination of (y,p) with a large point probability has the interpretation that y is a likely estimate if p if the true parameter is p. On Figure 3.9 we see two examples (n = 10 and n = 50) that illustrate how the point probabilities change as a function of (y, p). For a given estimate y = ˆp(x) we can from the figure read of, which values of p that make the estimate likely and which that do not. Note in comparison with the normal distribution, as considered in Example and on Figure 3.8, that the shape of the large values changes from a simple band around the diagonal to a cigar shape in this binomial example. This is because the variance of the estimator ˆp changes with p and is largest for p = 0.5 and smallest when p approaches 0 or 1. What we lack in the examples above is to quantify precisely how uncertain a given estimate is. We have illustrated how the distribution of the estimator as a function of θ can be turned around to give information about which values of θ that could have produced a given estimate. In both examples the estimate was judged to be likely for values of the parameter close to the estimate and unlikely for values far from the estimate (high values are close to the diagonal). We would like to report an interval, say, around the estimate such that we are pretty confident that the true parameter is within the interval, but how large should a reasonable interval be? That essentially depends upon how certain or confident we want to be on that the true parameter is in the interval. The following definition captures this in a general formulation. Definition A confidence set for the parameter θ given the observation x E is a subset I(x) Θ. If we for each x E have given a confidence set I(x) we say that the family (I(x)) x E are level 1 α confidence sets for the unknown parameter if for all θ Θ P θ (θ I(X)) 1 α. (3.9) If Θ R and I(x) is an interval we call I(x) a confidence interval. Note that P θ (θ I(X)) is a probability statement about whether the random confidence set I(X) will contain the parameter prior to conducting the experiment, and not whether the parameter belongs to the confidence set I(x) after having observed the realisation x of X. This is a very subtle point about the interpretation of confidence sets. It is the observation and therefore the confidence set that is a realisation of the random experiment and not the unknown 2 the distribution of ˆp is a distribution on {0, 1/n, 2/n,..., 1} a set that changes with n and the convention is to report the distribution in terms of nˆp, which is a distribution on Z.

8 116 Statistics parameter. For a given realisation x we simple can t tell whether I(x) contains θ or not, since θ is unknown. But if we a priori to making the experiment decide upon a family of 1 α confidence sets that we will choose among depending on the observation, then we know that the probability that the confidence set we end up with really contains θ is at least 1 α no matter what θ is. If α is chosen small, α = 0.05 say, then we are pretty confident that θ is actually in I(x) and if α = 0.01 we are even more so. How are we then in practice going to construct confidence sets? In this set of notes we will always base the confidence set upon a given function H : Θ Θ R and define for A R the set I(x) by I(x) = {θ Θ H(ˆθ(x),θ) A}. One has to decide upon which set A to choose to attain the desired level for such a confidence sets. We find that P θ (θ I(X)) = P θ (H(ˆθ,θ) A), so to choose A to make the sets have level 1 α is then a matter of finding the distribution of the real valued random variable H(ˆθ,θ) under P θ. This distribution is a transformation of the distribution of ˆθ using the function ˆθ H(ˆθ,θ). Example Assume that Θ R and denote by σ(θ) = V θˆθ the standard deviation of the estimator ˆθ under P θ. Define H : Θ Θ R by H(ˆϑ,θ) = ˆϑ θ σ(ˆϑ). With A = [ z,z] and x E it follows (with ˆϑ = ˆθ(x)) that I(x) = {θ Θ H(ˆϑ,θ) A} = { θ Θ zσ(ˆϑ) ˆϑ θ zσ(ˆϑ) } = [ˆϑ zσ(ˆϑ), ˆϑ + zσ(ˆϑ) ]. Note that these confidence sets are always intervals. A typical choice of z is z = 1.96 (sometimes the less precise choice of z = 2 is used). This relies on an important theoretical result valid for many estimators, namely that the distribution of H(ˆθ,θ) under P θ is approximately a N(0, 1)-distribution for all θ when n is large enough. The probability that a normally distributed random variable with mean 0 and variance 1 falls in the interval [ z,z] is 1 z 2π z exp ) ( x2 dx, 2 and it is easy numerically to compute these integrals for different values of z. They equal 0.95 for z = 1.96, i.e. with z = 1.96 the intervals I(x) defined above are approximately 0.95 confidence intervals.

9 Estimator quality, confidence sets and bootstrapping 117 In the example above we introduced a construction of confidence sets that only approximately have the desired level 1 α. This leads to the definition of the actual and nominal coverage probability. The actual coverage probability is the function θ P θ (θ I(X)), and the sets I(x), x E, are level 1 α confidence sets if the actual coverage probability is larger than 1 α for all θ. If we aim at producing level 1 α confidence sets we call 1 α the nominal coverage probability. The intervals produced as in the example above are therefore said to have nominal coverage probability 0.95, but the actual coverage probability is unknown. Hopefully, if the approximation used is good, it is not far from the nominal Example Still with Θ R and σ(θ) = V θˆθ we may choose the function H : Θ Θ R as H(ˆϑ,θ) = ˆϑ θ σ(θ). The subtle difference compared to the previous example is that we divide by the variance of the estimator under P θ instead of under Pˆϑ. With this H and A = [ z,z] we find that I(x) = { θ Θ z H(ˆϑ,θ) z } = { θ Θ σ(θ)z ˆϑ θ σ(θ)z } but this is as far as we get. If we don t know more about the standard deviation σ(θ) as a function of θ, we can not find a more explicit expression for the confidence sets. Moreover, even if we have an analytic formula for the standard deviation it is not likely to be easy to solve the inequalities and there is no particular reason that we should get a nice set, e.g. an interval, out of it. The choice of z = 1.96 will, however, still produce (approximate) 0.95 confidence sets since also in this setup does H(ˆθ,θ) have approximately a N(0,1)-distribution. If σ(θ) is not too rapidly varying as a function of θ it plays only a minor role whether the procedure in this example or the procedure in Example is used. However, if σ(θ) is rapidly varying the actual coverage probability of the confidence intervals produced from Example may be substantially worse than the actual coverage probability produced by this Example. Example In Example and we could consider squaring the H-function, thus corresponding to example we obtain H(ˆϑ,θ) = (ˆϑ θ) 2 σ 2 (ˆϑ) (3.10) Then proceeding as in those examples, except choosing A = [0,z], the confidence set becomes I(x) = { θ Θ H(ˆϑ,θ) z } = [ˆϑ σ(ˆϑ) z, ˆϑ + σ(ˆϑ) z ]

10 118 Statistics Hence we obtain the same interval as in Example The point is that a similar approach works for multidimensional parameter sets, see Math Box Math Box (Multidimensional confidence sets). The approach in Example can be generalised to multidimensional parameters. If Θ R p the θ parameter is a p-dimensional (column) vector and if Σ(θ) denotes the covariance matrix of ˆθ under P θ the natural generalisation of Example is given by H(ˆϑ,θ) = (ˆϑ θ) T Σ(ˆϑ) 1 (ˆϑ θ). (3.11) The notation (ˆϑ θ) T denotes the transposed (row) vector of the column vector ˆϑ θ. The corresponding confidence set, with A = [0,z], becomes I(x) = { θ Θ H(ˆϑ,θ) z } = { θ Θ (ˆϑ θ) T Σ(ˆϑ) 1 (ˆϑ θ) z }. Such a set is known as an ellipsoid in R p, and if p = 2 the set is an ellipse. The generalisation corresponding to Example is given by H(ˆϑ,θ) = (ˆϑ θ) T Σ(θ) 1 (ˆϑ θ). (3.12) Just as in Example there is no nice interpretation of the corresponding confidence set. The most serious problem for constructing confidence sets is to find the distribution of H(ˆθ,θ) under P θ. When considering the two examples above, Example and Example 3.3.8, we overcame this problem by considering the distribution H(ˆθ,θ) to be approximately a N(0, 1)-distribution. The real obstacle left is then to find the standard deviation σ(θ) of the estimator ˆθ under P θ. Additional problems with solving certain inequalities also occurred in Example In practice, as implemented in a number of statistical software packages, the variance is also approximated by an asymptotic formula, which is obtainable for standard estimators like the MLE, and the approach in Example is used. Standard programs rarely wants to bother with solving inequalities as in Example and reporting non-intervals is certainly not tractable. The typical confidence intervals reported by statistical software are therefore based on a number of approximations that are valid only if we have sufficiently many replications of our experiment. It is important to be aware of the fact that this can lead to confidence intervals with actual coverage probability lower than the nominal level. Example As a continuation of Example let X 1,...,X n be iid N(µ,σ 2 ) distributed. Consider the estimator ˆµ = 1 n X i n

11 Estimator quality, confidence sets and bootstrapping 119 of the mean and regard the variance as fixed. We know that the mean and variance of ˆµ is µ and σ 2 /n respectively. The distribution of ˆµ is N(µ,σ 2 /n) and therefore the distribution of n(ˆµ µ) H(ˆµ,µ) = σ is N(0,1), and the construction of confidence intervals as in Example is not an approximation for this particular case. This H function is the mathematical incarnation of Figure 3.8. Example Let X 1,...,X n be iid N(µ,σ0 2(µ)) distributed with σ 0(µ) some function depending upon µ. That is, the standard deviation changes as a function of the mean. Consider the same estimator ˆµ = 1 n X i n as above. Then ˆµ is N(µ,σ0 2 (µ)/n)-distributed, but n(ˆµ µ) H(ˆµ,µ) = σ 0 (ˆµ) does not in general have a normal distribution. In the notation used previously the standard deviation of the estimator is σ(µ) = V θ (ˆθ) = σ 0(µ), n and n(ˆµ µ) H 2 (ˆµ,µ) = σ(µ) is N(0,1)-distributed. Confidence sets based on H 2 and Example are therefore not approximate but the sets are not necessarily intervals. A real horror example on the deficits of using H is obtained by letting { 10 if µ 0 σ(µ) = 1 if µ > 0 Take µ = 0 and fix n = 10. With the observation x = (x 1,...,x 10 ) assume that ˆµ = x i > 0 (something that happens with probability 0.5 under P 0 ) a 0.95 confidence interval based upon H is I(x) = [ˆµ 1.96σ 0 (ˆµ)/ 10, ˆµ σ 0 (ˆµ)/ 10] = [ˆµ 0.620, ˆµ ]. We observe that I(x) doesn t contain the value µ = 0 if ˆµ > and this happens with probability The conclusion is that the actual coverage probability is lower than 0.58

12 120 Statistics for µ = 0. Of course µ = 0 is the worst possible choice of µ, but e.g. for µ = 1 the actual coverage probability is still lower than Using H 2 instead we are guaranteed to get the right coverage probability but the confidence sets are in some cases not intervals. We find that I(x) = {µ 1.96 H 2 (ˆµ,µ) 1.96} = {µ 0.620σ(µ) ˆµ µ 0.620σ(µ)} = [ˆµ 0.620, ˆµ ] [0, ) [ˆµ 6.20, ˆµ ] (,0]. The problem in the previous example is that the distribution of H(ˆµ,µ) changes in a very rapid way if µ changes from being negative to being positive. As a rule of thumb, the less the distribution of H(ˆθ,θ) changes with θ the more can we trust that the confidence intervals have an actual coverage probability close to the nominal level 1 α. One should note that if Θ R p is a multidimensional parameter space all the methods discussed in this section can be applied to each of the coordinates of the parameter. That is, an estimator is a p-dimensional map ˆθ = (ˆθ 1,..., ˆθ p ) and each of the coordinates ˆθ i is a real valued estimator. Considering each coordinate can then give marginal information about the uncertainty of a concrete estimate by producing e.g. marginal confidence intervals for each coordinate. One should be careful though to make any kind of multidimensional interpretation from such one-dimensional confidence intervals Bootstrapping The idea in bootstrapping for constructing confidence sets is to find an approximation of the distribution of H(ˆθ,θ), usually by doing some simulations, that depends upon the observed dataset x E. What we try is to approximate the distribution of H(ˆθ,θ) under P θ for all θ Θ by a single distribution, namely the distribution of H(ˆθ, ˆϑ) under a cleverly chosen probability measure P x, which may depend upon x. Since the distribution is allowed to depend upon the concrete observation x the construction of the confidence set that provides information about the uncertainty of the estimate ˆϑ = ˆθ(x) depends upon the observation itself. Hence what we are going to suggest is to pull information about the uncertainty of an estimate out from the very same data used to make the estimate, and for this reason the method is known as bootstrapping. Supposedly one of the stories in The Surprising Adventures of Baron Munchausen by Rudolf Erich Raspe ( ) contains a passage where the Baron pulls himself out of a deep lake by pulling his own bootstraps. Such a story can, however, not be found in the original writings by Raspe, but the stories of Baron Munchausen were borrowed and expanded by other writers, and one can find versions where the Baron indeed did something like that. To bootstrap is nowadays, with reference to the Baron Munchausen story, used to describe certain seemingly paradoxical constructions or actions. To boot a computer is an abbreviation of running a so-called bootstrap procedure that gets the computer up and running from scratch. The problem of finding the distribution of H(ˆθ,θ) for all θ is replaced by the problem of finding a single distribution under a probability measure that depends upon x. Different

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION THOMAS MAILUND Machine learning means different things to different people, and there is no general agreed upon core set of algorithms that must be