Statistics for Particle Physics. Kyle Cranmer. New York University. Kyle Cranmer (NYU) CERN Academic Training, Feb 2-5, 2009

Similar documents
Statistics for Particle Physics. Kyle Cranmer. New York University. Kyle Cranmer (NYU) CERN Academic Training, Feb 2-5, 2009

Statistics for the LHC Lecture 1: Introduction

Statistics for the LHC Lecture 2: Discovery

Statistical Methods for Particle Physics (I)

Hypothesis testing (cont d)

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1

Statistical Methods for Particle Physics Lecture 1: parameter estimation, statistical tests

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

Hypothesis Testing - Frequentist

Statistical Methods in Particle Physics Lecture 2: Limits and Discovery

Multivariate statistical methods and data mining in particle physics

Statistical Challenges of the LHC. Kyle Cranmer

Use of the likelihood principle in physics. Statistics II

Systematic uncertainties in statistical data analysis for particle physics. DESY Seminar Hamburg, 31 March, 2009

Primer on statistics:

Statistical Methods for Particle Physics Lecture 2: statistical tests, multivariate methods

Statistical Data Analysis Stat 3: p-values, parameter estimation

Advanced statistical methods for data analysis Lecture 1

Statistical Methods for Particle Physics Lecture 4: discovery, exclusion limits

Journeys of an Accidental Statistician

Physics 403. Segev BenZvi. Classical Hypothesis Testing: The Likelihood Ratio Test. Department of Physics and Astronomy University of Rochester

Some Statistical Tools for Particle Physics

Introduction to Statistical Methods for High Energy Physics

Statistical Methods for Particle Physics Lecture 3: Systematics, nuisance parameters

Physics 403. Segev BenZvi. Credible Intervals, Confidence Intervals, and Limits. Department of Physics and Astronomy University of Rochester

Hypothesis Testing. Part I. James J. Heckman University of Chicago. Econ 312 This draft, April 20, 2006

Parameter estimation and forecasting. Cristiano Porciani AIfA, Uni-Bonn

Some Topics in Statistical Data Analysis

Topics in Statistical Data Analysis for HEP Lecture 1: Bayesian Methods CERN-JINR European School of High Energy Physics Bautzen, June 2009

MODIFIED FREQUENTIST ANALYSIS OF SEARCH RESULTS (THE CL s METHOD)

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Recent developments in statistical methods for particle physics

Statistical Methods for Particle Physics Lecture 2: multivariate methods

Fundamental Probability and Statistics

Statistical Methods for Particle Physics Lecture 3: systematic uncertainties / further topics

ETH Zurich HS Mauro Donegà: Higgs physics meeting name date 1

FYST17 Lecture 8 Statistics and hypothesis testing. Thanks to T. Petersen, S. Maschiocci, G. Cowan, L. Lyons

OVERVIEW OF PRINCIPLES OF STATISTICS

Introductory Statistics Course Part II

Confidence Limits and Intervals 3: Various other topics. Roger Barlow SLUO Lectures on Statistics August 2006

E. Santovetti lesson 4 Maximum likelihood Interval estimation

Detection Theory. Chapter 3. Statistical Decision Theory I. Isael Diaz Oct 26th 2010

Irr. Statistical Methods in Experimental Physics. 2nd Edition. Frederick James. World Scientific. CERN, Switzerland

Statistical Methods for Particle Physics

Introduction to Likelihoods

STAT 499/962 Topics in Statistics Bayesian Inference and Decision Theory Jan 2018, Handout 01

Evaluation. Andrea Passerini Machine Learning. Evaluation

Detection theory 101 ELEC-E5410 Signal Processing for Communications

Introduction to Statistical Inference

Evaluation requires to define performance measures to be optimized

Why Try Bayesian Methods? (Lecture 5)

arxiv: v1 [hep-ex] 9 Jul 2013

Statistical Tools in Collider Experiments. Multivariate analysis in high energy physics

Lectures on Statistical Data Analysis

Statistical Methods in Particle Physics

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Intelligent Systems Statistical Machine Learning

Machine Learning Lecture 2

Lecture 5: Likelihood ratio tests, Neyman-Pearson detectors, ROC curves, and sufficient statistics. 1 Executive summary

Lecture Testing Hypotheses: The Neyman-Pearson Paradigm

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Discovery and Significance. M. Witherell 5/10/12

Generative Techniques: Bayes Rule and the Axioms of Probability

CSE 546 Final Exam, Autumn 2013

Statistical Methods for Discovery and Limits in HEP Experiments Day 3: Exclusion Limits

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Hypothesis testing:power, test statistic CMS:

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Algorithmisches Lernen/Machine Learning

Constructing Ensembles of Pseudo-Experiments

The Bayes classifier

PART I INTRODUCTION The meaning of probability Basic definitions for frequentist statistics and Bayesian inference Bayesian inference Combinatorics

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 7 Introduction to Statistical Decision Theory

Intelligent Systems Discriminative Learning, Neural Networks

Chapter 2. Binary and M-ary Hypothesis Testing 2.1 Introduction (Levy 2.1)

Applied Statistics. Multivariate Analysis - part II. Troels C. Petersen (NBI) Statistics is merely a quantization of common sense 1

Statistical Data Analysis Stat 2: Monte Carlo Method, Statistical Tests

ECE531 Lecture 6: Detection of Discrete-Time Signals with Random Parameters

CSE 473: Artificial Intelligence Autumn Topics

Statistical Methods in Particle Physics Day 4: Discovery and limits

Statistical Methods for LHC Physics

FINAL: CS 6375 (Machine Learning) Fall 2014

Part III. A Decision-Theoretic Approach and Bayesian testing

Lecturer: Dr. Adote Anum, Dept. of Psychology Contact Information:

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Detection theory. H 0 : x[n] = w[n]

Statistical Inference

theta a framework for template-based modeling and inference

Detection and Estimation Chapter 1. Hypothesis Testing

Statistical Methods in Particle Physics. Lecture 2

HYPOTHESIS TESTING: FREQUENTIST APPROACH.

Physics 403. Segev BenZvi. Choosing Priors and the Principle of Maximum Entropy. Department of Physics and Astronomy University of Rochester

2. What are the tradeoffs among different measures of error (e.g. probability of false alarm, probability of miss, etc.)?

Model Averaging (Bayesian Learning)

First two sided limit on BR(B s μ + μ - ) Matthew Herndon, University of Wisconsin Madison SUSY M. Herndon, SUSY

Machine Learning Lecture 3

Parameter Estimation, Sampling Distributions & Hypothesis Testing

Stat 135 Fall 2013 FINAL EXAM December 18, 2013

Transcription:

Statistics for Particle Physics Kyle Cranmer New York University 1

Hypothesis Testing 55

Hypothesis testing One of the most common uses of statistics in particle physics is Hypothesis Testing! assume one has pdf for data under two hypotheses: " Null-Hypothesis, H 0: eg. background-only " Alternate-Hypothesis H 1: eg. signal-plus-background! one makes a measurement and then needs to decide whether to reject or accept H0 Probability 0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 ±10 50 Events 60 80 100 120 140 160 180 Events Observed 56

Hypothesis testing One of the most common uses of statistics in particle physics is Hypothesis Testing! assume one has pdf for data under two hypotheses: " Null-Hypothesis, H 0: eg. background-only " Alternate-Hypothesis H 1: eg. signal-plus-background! one makes a measurement and then needs to decide whether to reject or accept H0 Probability 0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 ±10 50 Events 60 80 100 120 140 160 180 Events Observed 56

Hypothesis testing Before we can make much progress with statistics, we need to decide what it is that we want to do.! first let us define a few terms: " Rate of Type I error α " Rate of Type II " Power = 1 β β 57

Hypothesis testing Before we can make much progress with statistics, we need to decide what it is that we want to do.! first let us define a few terms: " Rate of Type I error α " Rate of Type II " Power = 1 β β Treat the two hypotheses asymmetrically! the Null is special. " Fix rate of Type I error, call it the size of the test 57

Hypothesis testing Before we can make much progress with statistics, we need to decide what it is that we want to do.! first let us define a few terms: " Rate of Type I error α " Rate of Type II " Power = 1 β β Treat the two hypotheses asymmetrically! the Null is special. " Fix rate of Type I error, call it the size of the test Now one can state a well-defined goal! Maximize power for a fixed rate of Type I error 57

Hypothesis testing The idea of a 5σ discovery criteria for particle physics is really a conventional way to specify the size of the test! usually 5σ corresponds to α =2.87 10 7 " eg. a very small chance we reject the standard model In the simple case of number counting it is obvious what region is sensitive to the presence of a new signal! but in higher dimensions it is not so easy Probability 0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 50 Events 60 80 100 120 140 160 180 Events Observed ±10 58

Hypothesis testing The idea of a 5σ discovery criteria for particle physics is really a conventional way to specify the size of the test! usually 5σ corresponds to α =2.87 10 7 " eg. a very small chance we reject the standard model In the simple case of number counting it is obvious what region is sensitive to the presence of a new signal! but in higher dimensions it is not so easy 1.5 1.5 1.5 1 1 1 x 1 x 2 x 2 0.5 0.5 0.5 0 0.5 1 1.5 x 0 0 0.5 1 1.5 x 0 0 0.5 1 1.5 x 1 58

Hypothesis testing The idea of a 5σ discovery criteria for particle physics is really a conventional way to specify the size of the test! usually 5σ corresponds to α =2.87 10 7 " eg. a very small chance we reject the standard model In the simple case of number counting it is obvious what region is sensitive to the presence of a new signal! but in higher dimensions it is not so easy Probability 0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 50 Events 60 80 100 120 140 160 180 Events Observed ±10 59

Hypothesis testing The idea of a 5σ discovery criteria for particle physics is really a conventional way to specify the size of the test! usually 5σ corresponds to α =2.87 10 7 " eg. a very small chance we reject the standard model In the simple case of number counting it is obvious what region is sensitive to the presence of a new signal! but in higher dimensions it is not so easy ecision bounda H 1 [G. Cowan] ion boundary: 0.05 0.045 0.04 H 0 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 Probability 50 Events 60 80 100 120 140 160 180 Events Observed ±10 59

Hypothesis testing The idea of a 5σ discovery criteria for particle physics is really a conventional way to specify the size of the test! usually 5σ corresponds to α =2.87 10 7 " eg. a very small chance we reject the standard model In the simple case of number counting it is obvious what region is sensitive to the presence of a new signal! but in higher dimensions it is not so easy ecision bounda H 1 [G. Cowan] ion boundary: 0.05 0.045 0.04 H 0 0.035 0.03 0.025 0.02 0.015 0.01H 0.005 1 0 Probability 50 Events 60 80 100 120 140 160 180 Events Observed ±10 H 0 H 1 59

The Neyman-Pearson Lemma In 1928-1938 Neyman & Pearson developed a theory in which one must consider competing Hypotheses: - the Null Hypothesis H 0 (background only) - the Alternate Hypothesis H 1 (signal-plus-background) Given some probability that we wrongly reject the Null Hypothesis α = P (x / W H 0 ) (Convention: if data falls in W then we accept H0) Find the region W such that we minimize the probability of wrongly accepting the H 0 (when H 1 is true) β = P (x W H 1 ) 60

The Neyman-Pearson Lemma The region accepting H 0 W that minimizes the probability of wrongly is just a contour of the Likelihood Ratio P (x H 1 ) P (x H 0 ) >k α Any other region of the same size will have less power The likelihood ratio is an example of a Test Statistic, eg. a real-valued function that summarizes the data in a way relevant to the hypotheses that are being tested 61

A short proof of Neyman-Pearson W W C P (x H 1 ) P (x H 0 ) >k α Consider the contour of the likelihood ratio that has size a given size (eg. probability under H0 is 1- ) α 62

A short proof of Neyman-Pearson Now consider a variation on the contour that has the same size 63

A short proof of Neyman-Pearson P ( H 0 )=P ( H 0 ) Now consider a variation on the contour that has the same size (eg. same probability under H0) 64

A short proof of Neyman-Pearson P (x H 1 ) P (x H 0 ) <k α P ( H 0 )=P ( H 0 ) P ( H 1 ) <P( H 0 )k α Because the new area is outside the contour of the likelihood ratio, we have an inequality 65

A short proof of Neyman-Pearson P (x H 1 ) P (x H 0 ) <k α P ( H 0 )=P ( H 0 ) P (x H 1 ) P (x H 0 ) >k α P ( H 1 ) <P( H 0 ) k α P ( H 1 ) >P( H 0 ) k α And for the region we lost, we also have an inequality Together they give... 66

A short proof of Neyman-Pearson P (x H 1 ) P (x H 0 ) <k α P ( H 0 )=P ( H 0 ) P (x H 1 ) P (x H 0 ) >k α P ( H 1 ) <P( H 0 ) k α P ( H 1 ) >P( H 0 ) k α P ( H 1 ) <P( H 1 ) The new region region has less power. 67

Decision Theory One of the deficiencies of the Neyman-Pearson approach is that one must specify the size of the test α! But where does come from? " is it purely conventional or is there a reason? α A great deal of literature related to statistics (and economics, etc.) is devoted to making decisions.! need to consider Utility or Risk of di#erent outcomes In the context of decision and utility theory there can be a justification, but this is rarely done in particle physics 68

Probability density An explicit likelihood ratio In that case: 0.12 0.1 0.08 0.06 0.04 0.02 (a) Q = L(x H 1) L(x H 0 ) = 0-15 -10-5 0 5 10 15 Nchan LEP Observed m H = 115 GeV/c 2 Expected for background Expected for signal plus background i P ois(n i s i + b i ) n i Nchan N chan q = ln Q = s tot -2 ln(q) -2 ln(q) 50 40 30 20 10-10 -20-30 + 0 i i j s i f s (x ij )+b i f b (x ij ) s i +b i P ois(n i b i ) n i j f b (x ij ) ( ln 1+ s ) if s (x ij ) b i f b (x ij ) n i j Observed Expected for background Expected for signal plus background LEP 106 108 110 112 114 116 118 120 m (GeV/c 2 ) 69

Kernel Estimation Kernel estimation is the generalization of Average Shifted Histograms ˆf 1 (x) = h(x i )= n i ( 4 3 1 nh(x i ) K ( ) x xi h(x i ) ) 1/5 σ ˆf 0 (x i ) n 1/5 Probability Density K.Cranmer, Comput.Phys.Commun. 136 (2001). [hep-ex/0011057] the data is the model Neural Network Output Adaptive Kernel estimation puts wider kernels in regions of low probability Used at LEP for describing pdfs from Monte Carlo (KEYS) 33

2 discriminating variables Often one uses the output of a neural network or multivariate algorithm in place of a true likelihood ratio.! That s fine, but what do you do with it?! If you have a fixed cut for all events, this is what you are doing: y 1 y 2 x 1 x 2 ( q = ln Q = s + ln 1+ sf ) s(x, y) bf b (x, y) q 70

Experiments vs. Events Ideally, you want to cut on the likelihood ratio for your experiment! equivalent to a sum of log likelihood ratios Easy to see that includes experiments where one event had a high LR and the other one was relatively small q 2 q 12 = q 1 + q 2 q 1 q 1 q 2 q 12 y 1 y 2 x 1 x 2 71

Decision Theory α α From Fred James lectures 72

Decisions: Bayesian & Frequentist Structure of P(x H0) & P(x H1) puts limits on allowable ranges of alpha, beta! Bayesians want to minimize expected risk based on priors and risk/utility of outcomes Frequentists don t have priors to work with, so they only have risk/ utility in two situations! minimax approach aims to minimize maximum risk " most conservative F. James, Ch. 6 " paranoid for games against nature Frequentist choice of alpha interpreted in Bayesian framework implies this ratio: OD OA = l 0 µ l 1 (1 µ) l 1 (1 µ)p ( X H 1 ) < l 0 µp ( X H 0 ) P (X H 1 ) P (X H 0 ) < l 0 µ l 1 (1 µ) 73

A Few Slides on Multivariate Algorithms 74

Use of Multivariate Methods Multivariate methods are now ubiquitous in high-energy physics, the nagging problem is that:! most multivariate techniques are borrowed from other fields, and they optimize some heuristic that physicists aren t interested in (like a score, or ad hoc training error)! the di#erence can be quite large when systematic uncertainties are taken into account A few recent developments! Evolutionary techniques! Matrix Element techniques ] 2 Statistical Uncertainty [GeV/c 14 12 10 8 6 4 2 0 Whiteson & Whiteson, hep-ex/0607012 11.1 ± 0.3 10.1 ± 0.4 10.0 ± 0.5 Heuristic Binary-C Multi-class 9.1 ± 0.4 Binary-M 7.3 ± 0.3 7.1 ± 0.2 NEAT classes NEAT features 75

The Neyman-Pearson Lemma The region W that minimizes the probability of wrongly accepting the H 0 is just a contour of the Likelihood Ratio: L(x H 0 ) L(x H 1 ) > k α This is the goal! The problem is we don t have access to L(x H 0 ) & L(x H 1 ) 76

The Neyman-Pearson Lemma The region W that minimizes the probability of wrongly accepting the H 0 is just a contour of the Likelihood Ratio: L(x H 0 ) L(x H 1 ) > k α This is the goal! W L(x H 0 ) = H µ+ The problem is we don t have access to L(x H 0 ) & L(x H 1 ) W µ 76

Matrix Element Techniques Instead of using generic machine learning algorithms, some members of the Tevatron experiments are starting to attack this convolution numerically L(x H 0 ) = W Phase-space Integral W H µ+ Matrixµ Element Transfer Functions 77

Matrix Element Techniques Instead of using generic machine learning algorithms, some members of the Tevatron experiments are starting to attack this convolution numerically Phase-space Integral Matrix Element Transfer Functions 77

Matrix Element Techniques for Theorists A few years ago, I realized that phenomenologists doing sensitivity studies can use the Neyman-Pearson lemma directly! directly integrate likelihood ratio! model detector e#ects with transfer functions " numerically much easier than experimental situation because one generates hypothetical data! just as one computes a cross-section for a new signal, one can compute a maximum significance (at leading order) Experimental: x ~ observable Q(x) = L(x H n 1) Pois(n s + b) j = f s+b(x j ) L(x H 0 ) Pois(n b) n j f b(x j ) n ( q(x) ln Q(x) = s + ln 1 + sf ) s(x j ) bf b (x j ) j=1 Theoretical: r ~ phase space q( r) = σ tot,s L + ln ( 1 + dσ ) s( r) dσ b ( r) Cranmer, Plehn. EPJ & hep-ph/0605268 78

79

Learning Machines 80

Examples of Learning Machines y 4 Cuts can be viewed as learning machines 3 1 2 x f = 1 <x< 1 2 and <y< 3 4 0 else 1 2 Neural Nets can be viewed as learning machines 3 4 5 6 7 weights & biases make up the parameters Input Units Output Unit Hidden Layers: Processing Units 81

Statistical Learning Theory & Hypothesis Testing exactly 82

Limits on Risk h(log(2l/h) + 1) log(η/4) l 1.4 1.2 VC Confidence 1 0.8 0.6 For Sample Size of 10,000 95% Confidence Level 0.4 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 h / l = VC Dimension / Sample Size 83

Limits on Risk h(log(2l/h) + 1) log(η/4) l 1.4 1.2 VC Confidence 1 0.8 0.6 For Sample Size of 10,000 95% Confidence Level 0.4 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 h / l = VC Dimension / Sample Size Support Vector Machines aim to minimize the limit on Risk by balancing Remp and complexity of learning machine characterized by h 83

Some personal history Archbishop of Canterbury Thomas Cranmer (born: 1489, executed: 1556) author of the Book of Common Prayer Kyle Cranmer (NYU) Two centuries later (when this Book had become an official prayer book of the Church of England) Thomas Bayes was a non-conformist minister (Presbyterian) who refused to use Cranmer!s book CERN Academic Training, Feb 2-5, 2009 84

VC Dimension 85

Importance of VC Dimension 86

Importance of VC Dimension Because we usually have an independent testing set, the limit on true Risk is often not very useful in practice 86

Genetic Programming R.S. Bowman and I brought a technique called Genetic Programming to HEP. It s a program that actually writes programs to search for the Higgs! Comput. Phys. Commun [physics/0402030] XOR AND <=> min > Iso1 POT XOR σm The FOCUS collaboration has recently used Genetic Programming to study doubly Cabibbo suppressed decay of D + K + π + π relative to Cabbibo favored D + K π + π + Iso2 p OoT π e2 NOT POT σm hep-ex/0503007 a) Selected CF b) Selected DCS 2 Events/5 MeV/c 12000 10000 Yield = 62441 ± 255 180 160 140 120 8000 100 6000 80 4000 2000 60 40 20 Yield = 466 ± 36 0 1.76 1.78 1.8 1.82 1.84 1.86 1.88 1.9 1.92 1.94 2 GeV/c 0 1.76 1.78 1.8 1.82 1.84 1.86 1.88 1.9 1.92 1.94 2 GeV/c 87

Remaining Lectures Lecture 3:! The Neyman-Construction (illustrated)! Inverted hypothesis tests: A dictionary for limits (intervals)! Coverage as a calibration for our statistical device! Compound hypotheses, nuisance parameters, & similar tests! Systematics, Systematics, Systematics Lecture 4:! Generalizing our procedures to include systematics! Eliminating nuisance parameters: profiling and marginalization! Introduction to ancillary statistics & conditioning! High dimensional models, Markov Chain Monte Carlo, and Hierarchical Bayes! The look elsewhere e#ect and false discovery rate 88