Primer on statistics: MLE, Confidence Intervals, and Hypothesis Testing ryan.reece@gmail.com http://rreece.github.io/ Insight Data Science - AI Fellows Workshop Feb 16, 018
Outline 1. Maximum likelihood estimators. Variance of MLE Confidence intervals 3. 4. Efficiency example 5. A/B testing
Is this significant? Statistical questions: How can we calculate the best-fit Data estimate 011 of some parameter? (b) s 300 400 500 600 Total background Point m H =15 estimation GeV, 1 x SM and confidence intervals = 7 TeV, Ldt = 4.8 fb How can we be precise and rigorous about how confident we are that a model is wrong? Hypothesis testing m llll -1 [GeV] Events / 5 GeV 1 Data 011 10 8 6 4 s Total background m H =15 GeV, 1 x SM = 7 TeV, -1 Ldt = 4.8 fb 3 events ATLAS H ZZ 4l 0 100 10 140 160 180 00 0 40 Has a local p0 of % (c) (*) [arxiv:107.0319] m llll [GeV] 3
Confidence Intervals A frequentist confidence interval is constructed such that, given the model, if the experiment were repeated, each time creating an interval, 95% (or other CL) of the intervals would contain the true population parameter (i.e. the interval has 95% coverage). They can be one-sided exclusions, e.g. X > 100. at 95% CL Two-sided measurements, e.g. X = 15.1 ± 0. at 68% CL Contours in or more parameters This is not the same as saying There is a 95% probability that the true parameter is in my interval. Any probability assigned to a parameter strictly involves a Bayesian prior probability. Bayes theorem: P(Theory Data) P(Data Theory) P(Theory) likelihood prior 4
Maximum likelihood method
Maximum likelihood Consider a Gaussian distributed measurement: f 1 (x µ, )= 1 (x µ) p exp If we repeat the measurement, the joint PDF is just a product: f(~x µ, )= Y i f 1 (x i µ, ) The likelihood function is the same function as the PDF, only thought of as a function of the parameters, given the data. The experiment is over. L(µ, ~x) =f(~x µ, ) The likelihood principle states that the best estimate of the true parameters are the values which maximize the likelihood. 6
Maximum likelihood It is often more convenient to consider the log likelihood, which has the same maximum. ln L =ln Y f 1 = X ln f 1 = X 1 ln( ) i (x i µ) Maximize: ) X i @ ln L @µ = X i x i ˆµ =0 (x i ˆµ) =0, ) ˆµ = 1 N X i x i = x Which agrees with our intuition that the best estimate of the mean of a Gaussian is the sample mean. 7
Maximum likelihood Note that in the case of a Gaussian PDF, maximizing likelihood is equivalent to minimizing. ln L = X i 1 ln( ) (x µ) is maximized when is minimized. = X i (x µ) This was a simple example of what statisticians call point estimation. Now we would like to quantify our error on this estimate. 8
Variance of MLEs
Variance True parameter µ,? Maximum Likelihood Estimator (MLE) ˆµ = 1 N X i x i = x Assuming this model, how confident are we that is close to µ,? What is the variance of ˆµ? ˆµ 10
Variance One would think that if the likelihood function varies rather slowly near the peak, then there is a wide range of values of the parameters that are consistent with the data, and thus the estimate should have a large error. To see the behavior of the likelihood function near the peak, consider the Taylor expansion of a general ln L of some parameter, nearitsmaximum likelihood estimate ˆ : @ ln L ln L( ) > 0 1 =lnl(ˆ )+ ( ˆ )+ @! ˆ @ ln L @ ( ˆ ) + ˆ {z } 1/s Dropping the remaining terms would imply that ( ˆ ) L( ) =L(ˆ ) exp s! Note that the ln L( ) is parabolic. 11
Variance So if this were a good approximation, we would expect that the variance of ˆ would be given by V [ˆ ] ˆ = s = @ ln L @ ˆ 1 It turns out that there is more truth to this than you would think, given by an important theorem in statistics, the Cramér-Rao Inequality: V [ˆ ] 1+ @b apple /E @ @ ln L @ An estimator s e ciency is defined to measure to what extent this inequality is equivalent: h i @ 1/E ln L @ "[ˆ ] ˆ V [ˆ ] ˆ 1
Variance It can be shown that in the large sample limit: Maximum likelihood estimators are unbiased and 100% e cient. Therefore, in principle, one can calculate the variance of an ML estimator with apple @ 1 ln L V [ˆ ] = E @ Calculating the expectation value would involve an analytic integration over the PDFs of all our possible measurements, or a Monte Carlo simulation of it. In practice, one usually uses the observed maximum likelihood estimate as the expectation. ˆ V [ˆ ] = @ ln L @ ˆ 1 13
Variance Let s go back to our simple example of a Gaussian likelihood to test this method of calculating the ML estimator s variance. V [ˆµ] =! 1 @ ln L @µ ˆµ @ ln L @µ = @ @µ X = @ @µ i X i 1 ln( ) x i ˆµ = X i (x i µ) 1 = N ) V [ˆµ] = N ) ˆµ = p N Which many of you will recognize as the proper error on the sample mean. If you are unfamiliar with it, we can actually derive it analytically in this case. 14
Variance Analytic variance of a gaussian: f 1 (x µ, )= 1 (x µ) p exp V [ x] =E[ x * µ ] E[ x]! 0 = E 4 1 X x i @ 1 N N i = 1 N E 4 X x i x j + X i6=j i 0 = 1 @ X * µ N E[x] + X i6=j i X j x i x j 3 13 A5 µ 5 µ 1 E[x ] A µ Y To find E[x ], consider V [x] = = E[x ] E[x] = E[x ] µ ) E[x ]= + µ 15
Variance ) V [ x] = 1 N 0 @ X i6=j µ + X i 1 ( + µ ) A µ = 1 N (N N)µ + N( + µ ) µ = N Which verifies the result we got from calculating derivatives of the likelihood function. @ 1 ln L V [ˆ ] = @ In practice, one usually doesn t calculate this analytically, but instead: calculates the derivatives numerically, or uses the ln L or method, described now ˆ 16
Variance by Back to our Taylor expansion of ln L: ln L( ) =lnl(ˆ )+ 1! @ ln L @ ( ˆ ) + ˆ {z } 1/ ˆ Let ln L( ) ln L( ) ln L(ˆ ) ln L( ) ' ( ˆ ) ˆ! ˆ ± n ˆ ln L(ˆ ± n ˆ ) = (±n ˆ ) ˆ ln L(ˆ ± n ˆ ) = n 17
Variance by ln L(ˆ ± n ˆ ) = n This is the most common definition of the 68% and 95% confidence intervals: 68%/ 1 ˆ : ln L = 1 95%/ ˆ : ln L = 18
Variance by Recall that in the case that the PDF is Gaussian, the ln L is just the statistic. ln L =, = X (x ) ln L(ˆ ± n ˆ ) =lnl(ˆ ± n ˆ ) ln L max = n 1 ) (ˆ ± n ˆ ) min = n (ˆ ± n ˆ ) =n 68%/ 1 ˆ : =1 95%/ ˆ : =4 (3.84 for σ) 19
Variance by Multi-dimensional case Q c = c n=1 n= n=3 0.683 1.00.30 3.53 0.95 3.84 5.99 7.8 0.99 6.63 9.1 11.3 0
Example: measuring an efficiency & A/B testing
Efficiency Out of n trials I measure k conversions Estimate the conversion rate its precision/confidence. Without a precision, we cannot know if observed changes are significant.
Efficiency f(k; n, p) = If Binomial Distribution n k p k (1 p) n k k = np(1 p) o k p (1 p) " = = n n p 0.01 ) (0.01) (0.99) " o = n o = 1 10 p n " MLE: 0.01 n ˆ" = n k for p =0.01 n " o " o/p 400 0.0050 50% 1,000 0.003 3%,000 0.00 % 10,000 0.0010 10% 10 6 0.0001 1% 3
Example taken from here: A/B testing https://www.blitzresults.com/en/ab-tests/ Data like a 1-dim, 4-bin histogram Assume conversion rate unchanged from A to B: " = N 0 A N A ' N 0 B N B ' N 0 B + N 0 B N A + N B Construct = X (x ) = (N 0 A N A ") N A " + ((N A N 0 A ) N A (1 ")) N A (1 ") +(A! B) 4
A/B testing Plugin above values gives: Remember: = 7.5 1σ (68%) : 1 σ (95%) : 3.84 3σ (99%) : 6.63 >99% CL significant improvement in B model 5
Hypothesis test 6
Take-aways MLEs are the best Don t just calculate MLE, find its variance! Quantify significance with a confidence interval Under many common assumptions ln L =, = X (x ) and one can calculate to determine confidence intervals/contours. 1σ (68%) : 1 σ (95%) : 3.84 3σ (99%) : 6.63 In the simplest case of measuring an efficiency, A/B-testing amounts to a 4-term that can be calculated by hand. If the is large, the change is significant! 7
Back up slides
Efficiency Other examples with numbers: Method Numerator Denominator Mean (Mode) Variance Uncertainty σ Poisson 1 45 0.0 0.00050 0.046 Binomial 1 45 0.0 0.00048 0.0197 Bayesian 1 45 0.0455 (0.0) 0.00085 0.0913 Method Numerator Denominator Mean (Mode) Variance Uncertainty σ Poisson 100 106 0.9433 0.0179 0.13151 Binomial 100 106 0.9433 0.00050 0.044 Bayesian 100 106 0.935 (0.9433) 0.00056 0.0358 9
Hypothesis testing Null hypothesis, H0: the SM Alternative hypothesis, H1: some new physics Type-I error: false positive rate (α) Type-II error: false negative rate (β) Power: 1-β Want to maximize power for a fixed false positive rate Particle physics has a tradition of claiming discovery at 5σ p0 =.9 10-7 = 1 in 3.5 million, and presents exclusion with p0 = 5%, (95% CL coverage ). Neyman-Pearson lemma (1933): the most powerful test for fixed α is the likelihood ratio: L(x H 0 ) L(x H 1 ) > k α Neyman 30
Power & Significance 31
Variance by What if L( ) is not Gaussian, i.e. ln L( ) is not parabolic? Likelihood functions have an invariance property, such that if g(x) is a monotonic function, then the maximum likelihood estimate of g( ) is g(ˆ ). In principle, one can find a change of variables function g( ), for which the ln L(g( )) is parabolic as a function of g( ). Therefore,usingthe invariance of the likelihood function, one can make inferences about a parameter of a non-gaussian likelihood function without actually finding such a transformation [James p. 34]. 3
Systematics from: http://philosophy-in-figures.tumblr.com/ 33