Inferring from data. Theory of estimators

Size: px

Start display at page:

Download "Inferring from data. Theory of estimators"

Hope Henderson
5 years ago
Views:

1 Inferring from data Theory of estimators 1

2 Estimators Estimator is any function of the data e(x) used to provide an estimate ( a measurement ) of an unknown parameter. Because estimators are functions of data, which are random variables, estimators themselves are random variables and therefore have their own probability distributions. Performances of an estimator are evaluated based on the properties of its distribution. 2

3 Classic properties of estimators Consistency (in probability). Desirable that the estimator e(x) of m converges in probability to m 8 lim N!1 p( m e(x) > )=0 Precision. Desirable that the variance of the estimator is minimal V (e(x)) = h e(x) he(x)i 2 i Bias. Desirable that the estimator is unbiased (b(m)=0) b(m) =h e(x) m i Distribution. Desirable that the distribution p(e(x); m) of the estimator is simple (possibly Gaussian) 3

4 What this is all about Consistency low-variance, unbiased estimator high-variance, unbiased estimator biased estimator true value true value of the parameter we are trying to measure 4

5 Comments bias Many estimators suffer from biases, which, in general depend on the parameter m being estimated. For an estimator e(x) of m, the bias b(m) is defined from E[e(x)] = h e(x) i = m + b(m) Typically biases are small wrt the variance. Issues, however, arise in combinations of biased estimates: the variance reduces but the bias remains and weights more. If the distribution p(x m) is known, the bias can be calculated explicitly. If the bias is independent of m (b(m) = b) then use another estimator u(x) = e(x) - b, which is unbiased and has same precision (variance) of e(x). If the bias depend on m, need an unbiased estimator of b (B(x)) to redefine u(x) = e(x) - B(x). The new estimator has greater variance than e(x), but loss in precision 5 is often smaller than bias.

6 Example bias correction w/ known distribution I have N points xi distributed as a Gaussian and use the following ML estimator to estimate its variance ˆ2 = 1 N NX x i =1 (x i x) 2 This estimator has a bias b = 2 /N and a variance Var(ˆ2) =2 4 N 1 N 2 So, I can rework an alternative estimator which has zero bias and a variance than that of the previous estimator s 2 = 1 N 1 NX (x i x) 2 x i =1 Var(s 2 )=2 4 1 N 1 which is only 1/N 2 larger 6

7 Example biases w/ unknown distributions In most practical cases, p(x m) is not well known, or the bias is hard to calculate explicitly. Estimated mass vs true mass Biases are studied by repeating the measurement on simulated samples and comparing results with input true values or applying the estimator in control samples for which results are known. If deviations O(variance) occur, correcting the results of the measurement by subtracting the bias is dangerous. Need confidence that simulated experiments reproduce all features of the data (but then also the source of the bias could probably be with identified and removed) When possible, work harder and suppress biases Bias vs true mass 2007 measurement of lepton+jets top-quark mass by CDF 7

Information An useful point of view for dealing with estimation and data reduction is the theory of Fisher s

Information should increase linearly with the number of observations.

Fisher (1890-1962) be conditional to what we are interested to.

8 Information An useful point of view for dealing with estimation and data reduction is the theory of Fisher s information (of some data on some parameter). Information should increase linearly with the number of observations. (doubling observations, leads to double information) Ronald A. Fisher ( ) be conditional to what we are interested to. Irrelevant data for the quantity to be estimated should provide no information. connect with precision: the greater the information, the better the precision. Any quantity with these properties is desirable in data reduction. Could think of pursuing methodologies that maximize reduction while minimizing information loss 8

9 Fisher information (If it exist) the Fisher information of an observation x on the parameter m, related by the likelihood p(xim) = Lx(m) is I x (m) =E # 2 log(lx If the parameter of interest is a vector of parameters, this generalizes to [I x (m)] ij = E log(lx log(l x j If (i) the possible values of x do not depend on m and (ii) the likelihood is twice differentiable and derivatives in m and integrals in x commute [I x (m)] ij = E 2 log(l x i m j Information is additive: information of N independent measurements is NIx 9

10 Comments variance Small variance is good it implies high precision. Can it be arbitrarily small at given number of observations N? No. The variance of an estimator m is limited by the Cramer-Rao inequality Harald Cramer ( ) Var( ˆm) =E[( ˆm E[ˆm]) 2 ] (1 + db/dm) 2 I ˆm (m) (1 + db/dm) 2 I x (m) where b = E[m ] - m is the bias of the estimator and Ix(m) is the Fisher information of the observation x on the parameter m C.R. Rao (1920 ) Because for N observations the Fisher information is proportional to N, for increasing number of measurements the variance of the estimator (that is the precision of the measurement ) does not decrease faster than 1/N 10

11 Comments minimum variance bound The minimum variance bound is a useful property to decide which estimator to use in a certain measurement. It also provides a practical and convenient way of estimating the best statistical resolution achievable in a quantity to measure, before actually carrying out the measurement. It suffices to have a decent simulation to generate the possible likelihood of the observations and apply the CR inequality, assuming equalities. 11

12 Efficiency and sufficiency When both inequalities are equalities, the estimator reach minimum variance and is called efficient. A condition for this to happen is that once the value m of the estimator is known, complete knowledge of all data x does not provide further information on parameter m. If that happens m is a sufficient statistics for m Trivial cases: x or any invertible f(x) are sufficient statistics More interesting are cases when the dimensionality of m is smaller than the dimensionality of x. There we have data reduction without information loss 12

13 Aside: Darmois theorem Given a likelihood L(m) = p(x m), not always can find a sufficient statistics with finite number of dimensions s independent of the number of observations N. For this to happen, the likelihood of a single measurement needs to belong to the exponential family L(m) =p(x m) =exp " sx i=1 i (x)a i (m)+ Georges Darmois ( ) (x)+c(m) A rather restrictive condition: in most cases reducing data observations x into a lower-dimensional estimator leads to a loss of information. # Nevertheless, data reduction is still often convenient if the information loss is moderate. 13

14 Estimator robustness Robustness of an estimator expresses the stability of its properties (mainly variance and bias) against variations of the shape of the likelihood p(x m). Rather important in practice because in most cases p(x m) is unknown, or known only approximately. A good approach is toward robustness is pick an estimator and a sufficiently broad class of p(x m); evaluate the maximum variance of the estimator over the space of all the p(x m) as figure of merit of the estimator performance. repeat for various estimators and choose that showing the minimum value of the maximum variance 14

15 Inferring from data Maximum likelihood method 15

16 ML estimator properties Call m the estimate of unknown parameter m obtained by finding the value of m that maximizes the likelihood p(x m) = Lx(m). Under weak hypotheses m is consistent. In addition, if Lx(m) is differentiable twice and the set of possible values for x does not depend on the value of m, then m is asymptotically efficient for number of observations N > the variance of the estimator E [ m - E 2 [m ] ] is the minimum possible asymptotically normal for number of observations N > the difference m -m is distributed as a Gaussian with variance proportional to 1/N There are many estimators, but the ML estimator is what you will use most often. Not necessarily the right one. It s just simple enough and has useful properties. 16

17 (Another) ML example Poisson Want to study a Poisson process and assume the pdf: Rather than maximising L, one can minimise -ln L. d dµ ln L(µ) ˆµ =0 p(j µ) = µj j! e µ = L(µ) d dµ (µ j ln µ +lnj!) = 1 j µ Given observation j, the ML estimator of the mean rate of success μ is μ = j 17

18 Poisson illustrated Assume one measurement of j=5 out of a Poisson distribution with no background. Data are fixed, parameter mu varies. p(j µ) = µj j! e µ = L(µ) L(µ j = 5) = µ5 5! e µ j =5 j 18

19 Poisson illustrated Assume one measurement of j=5 out of a Poisson distribution with no background. Data are fixed, parameter mu varies. p(j µ) = µj j! e µ = L(µ) L(µ j = 5) = µ5 5! e µ µ =0.5 j =5 j 19

20 Poisson illustrated Assume one measurement of j=5 out of a Poisson distribution with no background. Data are fixed, parameter mu varies. p(j µ) = µj j! e µ = L(µ) L(µ j = 5) = µ5 5! e µ µ =0.5 j =5 µ =5 j 20

21 Poisson illustrated Assume one measurement of j=5 out of a Poisson distribution with no background. Data are fixed, parameter mu varies. p(j µ) = µj j! e µ = L(µ) L(µ j = 5) = µ5 5! e µ µ =0.5 j =5 µ =5 µ = 20 j 21

22 Poisson illustrated Assume one measurement of j=5 out of a Poisson distribution with no background. Data are fixed, parameter mu varies. p(j µ) = µj j! e µ = L(µ) L(µ j = 5) = µ5 5! e µ µ =0.5 j =5 µ =5 µ j =5 µ = 20 j 22

23 ML estimator variance Seen a few examples of ML estimates: given observations x0, and assuming L(m)= p(x0 m) known, we learned that the value m that maximizes L offers an estimate of the true value of m that has some attractive properties. OK, m is the central value of our measurement what s the uncertainty? Depends on the estimator s variance, which is also part of the inference. Analytical calculation of E[(m - E[m ]) 2 ]. Requires knowledge of the analytical form of p(x m) and integrals should not be intractable. Rarely used except for simple textbook examples (Poisson, exponential, Gaussian ) Approximation to the minimum variance bound. Most commonly used, good compromise in simple realistic applications. Brute force. Ultimate solution, accurate but work intensive 23

24 Approximating the variance The minimum variance bound offers an approximated estimate of the variance as the curvature (2nd derivative) of the log-likelihood at its maximum. [I x (m)] ij = E 2 log(l x i m j ˆV (ˆm) 1/E 2 ln 2 ln 2 1 m=ˆm This is what most common minimisation packages (like MINUIT) will give you as uncertainty of the ML estimate. No guarantee that for N finite the ML estimator has reached minimum variance, but in many cases it is close enough 24

25 Approximating the variance (graphical -1D) Expanding in Taylor at the maximum ln L(m) =lnl(ˆm)+(m ln m=ˆm ln 2 m=ˆm (m ˆm) First term is Lmax. Second is zero. In the third term use the minimum variance bound to make the replacement 2 ln 2 m=ˆm! 1ˆ2 ln L(m) ln L max (m ˆm) 2 2 ˆ2 ln L(ˆm ± ˆ) ln L max 1 2 and get Values of the parameter corresponding to a decrease of ln(l) by 1/2 units approximate the boundaries of 1-sigma uncertainties Uncertainty central value 25

26 Brute force The safest and most robust way of achieving this is to look at the width of the ML estimator distribution obtained from repeated measurements on independent samples It implies generating a large quantity of simulated experiments and repeat the inference in each to study the distributions of the estimator. This is usually Gaussian, for N large enough and the variance provides a measure of dispersion. This requires a lot of work, and often is not required. But when the previous approximations fail, this is the only was of getting your estimates right 26

27 ML caveats ML properties only hold for infinite observations. For N finite, ML could show biases* and its distribution is unknown. The number of observations N needed to approach asymptotic regime depends on the likelihood. Low-dimensional, regular likelihoods get already asymptotic with O(10) observations; others need much more. ML estimator won t tell how good is your fit, that is, whether the assumed p(x m) is a reasonable model of the observed data or not. * this is why it is wrong to take the arithmetic mean of results obtained from multiple ML estimators based on very small samples each 27

28 Simulation! Simulation! Simulation! The variety of issues and pathologies real likelihoods show is so broad that it is unrealistic to devise specific guidelines to address them. Much of this is art/blackmagic based on clear understanding of fundamentals and previous experience. The one, general recommendation is to use extensively simplified simulated experiments ( toy Monte Carlo ) to understand the distribution of the ML estimators and their properties prior to applying them to data. This is of utmost importance and is usually done by (1) choosing a plausible true value of the relevant parameter m (2) feeding it into L(m) and generate several sets of simulated data x from random numbers distributed according to p(x m) (3) maximize the likelihood in each set and look at the distribution of the estimator (4) repeat for a few others choices of true value m (important and often overlooked) 28

29 A standard example - pulls Toys per µ = ± 0.05 σ = 1.01 ± (x fit x true )/σ fit ML estimator of y perhaps biased. Uncertainty seems OK Toys per ML estimator of x unbiased. Uncertainty seems OK µ = 0.08 ± 0.04 σ = 1.00 ± (y fit y true )/σ fit Each entry is a simulated experiment, generated with the same set of true parameters. Distribution of the difference between ML estimate and the true value of the parameter, divided by the estimate of the std dev. 29

Modern Methods of Data Analysis - WS 07/08

Modern Methods of Data Analysis Lecture VIc (19.11.07) Contents: Maximum Likelihood Fit Maximum Likelihood (I) Assume N measurements of a random variable Assume them to be independent and distributed according