Primer on statistics:

Similar documents
Statistics for Particle Physics. Kyle Cranmer. New York University. Kyle Cranmer (NYU) CERN Academic Training, Feb 2-5, 2009

Statistical Data Analysis Stat 3: p-values, parameter estimation

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

FYST17 Lecture 8 Statistics and hypothesis testing. Thanks to T. Petersen, S. Maschiocci, G. Cowan, L. Lyons

Statistics for the LHC Lecture 2: Discovery

Use of the likelihood principle in physics. Statistics II

Some Statistical Tools for Particle Physics

Hypothesis testing (cont d)

Hypothesis Testing - Frequentist

Statistics for the LHC Lecture 1: Introduction

Hypothesis Testing: The Generalized Likelihood Ratio Test

Statistical Methods in Particle Physics Lecture 2: Limits and Discovery

Statistical Methods for Particle Physics Lecture 4: discovery, exclusion limits

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Statistical Methods for Particle Physics Lecture 3: Systematics, nuisance parameters

Physics 403. Segev BenZvi. Credible Intervals, Confidence Intervals, and Limits. Department of Physics and Astronomy University of Rochester

E. Santovetti lesson 4 Maximum likelihood Interval estimation

Statistical Methods in Particle Physics

Practice Problems Section Problems

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

Basic Concepts of Inference

Statistical Methods for Particle Physics (I)

Inferring from data. Theory of estimators

Statistical Methods for Particle Physics Lecture 2: statistical tests, multivariate methods

Theory of Maximum Likelihood Estimation. Konstantin Kashin

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Multivariate statistical methods and data mining in particle physics

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Hypothesis Testing. 1 Definitions of test statistics. CB: chapter 8; section 10.3

Statistical Methods for Particle Physics Lecture 3: systematic uncertainties / further topics

arxiv: v1 [hep-ex] 9 Jul 2013

Irr. Statistical Methods in Experimental Physics. 2nd Edition. Frederick James. World Scientific. CERN, Switzerland

Statistical Methods for Astronomy

Statistical Methods for Particle Physics Lecture 1: parameter estimation, statistical tests

Part 4: Multi-parameter and normal models

Statistical Methods for Astronomy

BTRY 4090: Spring 2009 Theory of Statistics

P Values and Nuisance Parameters

Lecture 5: Bayes pt. 1

Spring 2012 Math 541B Exam 1

Statistics. Lent Term 2015 Prof. Mark Thomson. 2: The Gaussian Limit

Modern Methods of Data Analysis - WS 07/08

Statistics for Particle Physics. Kyle Cranmer. New York University. Kyle Cranmer (NYU) CERN Academic Training, Feb 2-5, 2009

exp{ (x i) 2 i=1 n i=1 (x i a) 2 (x i ) 2 = exp{ i=1 n i=1 n 2ax i a 2 i=1

LECTURE NOTES FYS 4550/FYS EXPERIMENTAL HIGH ENERGY PHYSICS AUTUMN 2013 PART I A. STRANDLIE GJØVIK UNIVERSITY COLLEGE AND UNIVERSITY OF OSLO

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

Signal detection theory

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Systematic uncertainties in statistical data analysis for particle physics. DESY Seminar Hamburg, 31 March, 2009

Variations. ECE 6540, Lecture 10 Maximum Likelihood Estimation

Introduction to Error Analysis

Statistical Methods for Discovery and Limits in HEP Experiments Day 3: Exclusion Limits

Testing Hypothesis. Maura Mezzetti. Department of Economics and Finance Università Tor Vergata

Advanced statistical methods for data analysis Lecture 1

Machine Learning 4771

Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn!

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Some Topics in Statistical Data Analysis

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

Asymptotic formulae for likelihood-based tests of new physics

Hypothesis testing: theory and methods

Statistical Distributions and Uncertainty Analysis. QMRA Institute Patrick Gurian

A Very Brief Summary of Statistical Inference, and Examples

Math 494: Mathematical Statistics

Topic 19 Extensions on the Likelihood Ratio

Summary of Chapters 7-9

Introductory Statistics Course Part II

Mathematical statistics

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Exercises and Answers to Chapter 1

STAT 730 Chapter 4: Estimation

f(x θ)dx with respect to θ. Assuming certain smoothness conditions concern differentiating under the integral the integral sign, we first obtain

Physics 403. Segev BenZvi. Parameter Estimation, Correlations, and Error Bars. Department of Physics and Astronomy University of Rochester

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Brandon C. Kelly (Harvard Smithsonian Center for Astrophysics)

Introduction to Maximum Likelihood Estimation

Terminology Suppose we have N observations {x(n)} N 1. Estimators as Random Variables. {x(n)} N 1

Error analysis for efficiency

Recent developments in statistical methods for particle physics

Statistical Methods for Particle Physics

Physics 736. Experimental Methods in Nuclear-, Particle-, and Astrophysics. - Statistics and Error Analysis -

Statistics: Learning models from data

32. STATISTICS. 32. Statistics 1

Hypothesis testing:power, test statistic CMS:

ST495: Survival Analysis: Hypothesis testing and confidence intervals

Practical Statistics part II Composite hypothesis, Nuisance Parameters

Discovery significance with statistical uncertainty in the background estimate

Statistics. Lecture 2 August 7, 2000 Frank Porter Caltech. The Fundamentals; Point Estimation. Maximum Likelihood, Least Squares and All That

POLI 8501 Introduction to Maximum Likelihood Estimation

32. STATISTICS. 32. Statistics 1

Mathematical Statistics

Lecture : Probabilistic Machine Learning

Math 494: Mathematical Statistics

Testing Statistical Hypotheses

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45

STATISTICS SYLLABUS UNIT I

Notes on the Multivariate Normal and Related Topics

Review. December 4 th, Review

Bayesian Regression Linear and Logistic Regression

Transcription:

Primer on statistics: MLE, Confidence Intervals, and Hypothesis Testing ryan.reece@gmail.com http://rreece.github.io/ Insight Data Science - AI Fellows Workshop Feb 16, 018

Outline 1. Maximum likelihood estimators. Variance of MLE Confidence intervals 3. 4. Efficiency example 5. A/B testing

Is this significant? Statistical questions: How can we calculate the best-fit Data estimate 011 of some parameter? (b) s 300 400 500 600 Total background Point m H =15 estimation GeV, 1 x SM and confidence intervals = 7 TeV, Ldt = 4.8 fb How can we be precise and rigorous about how confident we are that a model is wrong? Hypothesis testing m llll -1 [GeV] Events / 5 GeV 1 Data 011 10 8 6 4 s Total background m H =15 GeV, 1 x SM = 7 TeV, -1 Ldt = 4.8 fb 3 events ATLAS H ZZ 4l 0 100 10 140 160 180 00 0 40 Has a local p0 of % (c) (*) [arxiv:107.0319] m llll [GeV] 3

Confidence Intervals A frequentist confidence interval is constructed such that, given the model, if the experiment were repeated, each time creating an interval, 95% (or other CL) of the intervals would contain the true population parameter (i.e. the interval has 95% coverage). They can be one-sided exclusions, e.g. X > 100. at 95% CL Two-sided measurements, e.g. X = 15.1 ± 0. at 68% CL Contours in or more parameters This is not the same as saying There is a 95% probability that the true parameter is in my interval. Any probability assigned to a parameter strictly involves a Bayesian prior probability. Bayes theorem: P(Theory Data) P(Data Theory) P(Theory) likelihood prior 4

Maximum likelihood method

Maximum likelihood Consider a Gaussian distributed measurement: f 1 (x µ, )= 1 (x µ) p exp If we repeat the measurement, the joint PDF is just a product: f(~x µ, )= Y i f 1 (x i µ, ) The likelihood function is the same function as the PDF, only thought of as a function of the parameters, given the data. The experiment is over. L(µ, ~x) =f(~x µ, ) The likelihood principle states that the best estimate of the true parameters are the values which maximize the likelihood. 6

Maximum likelihood It is often more convenient to consider the log likelihood, which has the same maximum. ln L =ln Y f 1 = X ln f 1 = X 1 ln( ) i (x i µ) Maximize: ) X i @ ln L @µ = X i x i ˆµ =0 (x i ˆµ) =0, ) ˆµ = 1 N X i x i = x Which agrees with our intuition that the best estimate of the mean of a Gaussian is the sample mean. 7

Maximum likelihood Note that in the case of a Gaussian PDF, maximizing likelihood is equivalent to minimizing. ln L = X i 1 ln( ) (x µ) is maximized when is minimized. = X i (x µ) This was a simple example of what statisticians call point estimation. Now we would like to quantify our error on this estimate. 8

Variance of MLEs

Variance True parameter µ,? Maximum Likelihood Estimator (MLE) ˆµ = 1 N X i x i = x Assuming this model, how confident are we that is close to µ,? What is the variance of ˆµ? ˆµ 10

Variance One would think that if the likelihood function varies rather slowly near the peak, then there is a wide range of values of the parameters that are consistent with the data, and thus the estimate should have a large error. To see the behavior of the likelihood function near the peak, consider the Taylor expansion of a general ln L of some parameter, nearitsmaximum likelihood estimate ˆ : @ ln L ln L( ) > 0 1 =lnl(ˆ )+ ( ˆ )+ @! ˆ @ ln L @ ( ˆ ) + ˆ {z } 1/s Dropping the remaining terms would imply that ( ˆ ) L( ) =L(ˆ ) exp s! Note that the ln L( ) is parabolic. 11

Variance So if this were a good approximation, we would expect that the variance of ˆ would be given by V [ˆ ] ˆ = s = @ ln L @ ˆ 1 It turns out that there is more truth to this than you would think, given by an important theorem in statistics, the Cramér-Rao Inequality: V [ˆ ] 1+ @b apple /E @ @ ln L @ An estimator s e ciency is defined to measure to what extent this inequality is equivalent: h i @ 1/E ln L @ "[ˆ ] ˆ V [ˆ ] ˆ 1

Variance It can be shown that in the large sample limit: Maximum likelihood estimators are unbiased and 100% e cient. Therefore, in principle, one can calculate the variance of an ML estimator with apple @ 1 ln L V [ˆ ] = E @ Calculating the expectation value would involve an analytic integration over the PDFs of all our possible measurements, or a Monte Carlo simulation of it. In practice, one usually uses the observed maximum likelihood estimate as the expectation. ˆ V [ˆ ] = @ ln L @ ˆ 1 13

Variance Let s go back to our simple example of a Gaussian likelihood to test this method of calculating the ML estimator s variance. V [ˆµ] =! 1 @ ln L @µ ˆµ @ ln L @µ = @ @µ X = @ @µ i X i 1 ln( ) x i ˆµ = X i (x i µ) 1 = N ) V [ˆµ] = N ) ˆµ = p N Which many of you will recognize as the proper error on the sample mean. If you are unfamiliar with it, we can actually derive it analytically in this case. 14

Variance Analytic variance of a gaussian: f 1 (x µ, )= 1 (x µ) p exp V [ x] =E[ x * µ ] E[ x]! 0 = E 4 1 X x i @ 1 N N i = 1 N E 4 X x i x j + X i6=j i 0 = 1 @ X * µ N E[x] + X i6=j i X j x i x j 3 13 A5 µ 5 µ 1 E[x ] A µ Y To find E[x ], consider V [x] = = E[x ] E[x] = E[x ] µ ) E[x ]= + µ 15

Variance ) V [ x] = 1 N 0 @ X i6=j µ + X i 1 ( + µ ) A µ = 1 N (N N)µ + N( + µ ) µ = N Which verifies the result we got from calculating derivatives of the likelihood function. @ 1 ln L V [ˆ ] = @ In practice, one usually doesn t calculate this analytically, but instead: calculates the derivatives numerically, or uses the ln L or method, described now ˆ 16

Variance by Back to our Taylor expansion of ln L: ln L( ) =lnl(ˆ )+ 1! @ ln L @ ( ˆ ) + ˆ {z } 1/ ˆ Let ln L( ) ln L( ) ln L(ˆ ) ln L( ) ' ( ˆ ) ˆ! ˆ ± n ˆ ln L(ˆ ± n ˆ ) = (±n ˆ ) ˆ ln L(ˆ ± n ˆ ) = n 17

Variance by ln L(ˆ ± n ˆ ) = n This is the most common definition of the 68% and 95% confidence intervals: 68%/ 1 ˆ : ln L = 1 95%/ ˆ : ln L = 18

Variance by Recall that in the case that the PDF is Gaussian, the ln L is just the statistic. ln L =, = X (x ) ln L(ˆ ± n ˆ ) =lnl(ˆ ± n ˆ ) ln L max = n 1 ) (ˆ ± n ˆ ) min = n (ˆ ± n ˆ ) =n 68%/ 1 ˆ : =1 95%/ ˆ : =4 (3.84 for σ) 19

Variance by Multi-dimensional case Q c = c n=1 n= n=3 0.683 1.00.30 3.53 0.95 3.84 5.99 7.8 0.99 6.63 9.1 11.3 0

Example: measuring an efficiency & A/B testing

Efficiency Out of n trials I measure k conversions Estimate the conversion rate its precision/confidence. Without a precision, we cannot know if observed changes are significant.

Efficiency f(k; n, p) = If Binomial Distribution n k p k (1 p) n k k = np(1 p) o k p (1 p) " = = n n p 0.01 ) (0.01) (0.99) " o = n o = 1 10 p n " MLE: 0.01 n ˆ" = n k for p =0.01 n " o " o/p 400 0.0050 50% 1,000 0.003 3%,000 0.00 % 10,000 0.0010 10% 10 6 0.0001 1% 3

Example taken from here: A/B testing https://www.blitzresults.com/en/ab-tests/ Data like a 1-dim, 4-bin histogram Assume conversion rate unchanged from A to B: " = N 0 A N A ' N 0 B N B ' N 0 B + N 0 B N A + N B Construct = X (x ) = (N 0 A N A ") N A " + ((N A N 0 A ) N A (1 ")) N A (1 ") +(A! B) 4

A/B testing Plugin above values gives: Remember: = 7.5 1σ (68%) : 1 σ (95%) : 3.84 3σ (99%) : 6.63 >99% CL significant improvement in B model 5

Hypothesis test 6

Take-aways MLEs are the best Don t just calculate MLE, find its variance! Quantify significance with a confidence interval Under many common assumptions ln L =, = X (x ) and one can calculate to determine confidence intervals/contours. 1σ (68%) : 1 σ (95%) : 3.84 3σ (99%) : 6.63 In the simplest case of measuring an efficiency, A/B-testing amounts to a 4-term that can be calculated by hand. If the is large, the change is significant! 7

Back up slides

Efficiency Other examples with numbers: Method Numerator Denominator Mean (Mode) Variance Uncertainty σ Poisson 1 45 0.0 0.00050 0.046 Binomial 1 45 0.0 0.00048 0.0197 Bayesian 1 45 0.0455 (0.0) 0.00085 0.0913 Method Numerator Denominator Mean (Mode) Variance Uncertainty σ Poisson 100 106 0.9433 0.0179 0.13151 Binomial 100 106 0.9433 0.00050 0.044 Bayesian 100 106 0.935 (0.9433) 0.00056 0.0358 9

Hypothesis testing Null hypothesis, H0: the SM Alternative hypothesis, H1: some new physics Type-I error: false positive rate (α) Type-II error: false negative rate (β) Power: 1-β Want to maximize power for a fixed false positive rate Particle physics has a tradition of claiming discovery at 5σ p0 =.9 10-7 = 1 in 3.5 million, and presents exclusion with p0 = 5%, (95% CL coverage ). Neyman-Pearson lemma (1933): the most powerful test for fixed α is the likelihood ratio: L(x H 0 ) L(x H 1 ) > k α Neyman 30

Power & Significance 31

Variance by What if L( ) is not Gaussian, i.e. ln L( ) is not parabolic? Likelihood functions have an invariance property, such that if g(x) is a monotonic function, then the maximum likelihood estimate of g( ) is g(ˆ ). In principle, one can find a change of variables function g( ), for which the ln L(g( )) is parabolic as a function of g( ). Therefore,usingthe invariance of the likelihood function, one can make inferences about a parameter of a non-gaussian likelihood function without actually finding such a transformation [James p. 34]. 3

Systematics from: http://philosophy-in-figures.tumblr.com/ 33