Statistics for Particle Physics Kyle Cranmer New York University 1
Hypothesis Testing 55
Hypothesis testing One of the most common uses of statistics in particle physics is Hypothesis Testing! assume one has pdf for data under two hypotheses: " Null-Hypothesis, H 0: eg. background-only " Alternate-Hypothesis H 1: eg. signal-plus-background! one makes a measurement and then needs to decide whether to reject or accept H0 Probability 0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 ±10 50 Events 60 80 100 120 140 160 180 Events Observed 56
Hypothesis testing One of the most common uses of statistics in particle physics is Hypothesis Testing! assume one has pdf for data under two hypotheses: " Null-Hypothesis, H 0: eg. background-only " Alternate-Hypothesis H 1: eg. signal-plus-background! one makes a measurement and then needs to decide whether to reject or accept H0 Probability 0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 ±10 50 Events 60 80 100 120 140 160 180 Events Observed 56
Hypothesis testing Before we can make much progress with statistics, we need to decide what it is that we want to do.! first let us define a few terms: " Rate of Type I error α " Rate of Type II " Power = 1 β β 57
Hypothesis testing Before we can make much progress with statistics, we need to decide what it is that we want to do.! first let us define a few terms: " Rate of Type I error α " Rate of Type II " Power = 1 β β Treat the two hypotheses asymmetrically! the Null is special. " Fix rate of Type I error, call it the size of the test 57
Hypothesis testing Before we can make much progress with statistics, we need to decide what it is that we want to do.! first let us define a few terms: " Rate of Type I error α " Rate of Type II " Power = 1 β β Treat the two hypotheses asymmetrically! the Null is special. " Fix rate of Type I error, call it the size of the test Now one can state a well-defined goal! Maximize power for a fixed rate of Type I error 57
Hypothesis testing The idea of a 5σ discovery criteria for particle physics is really a conventional way to specify the size of the test! usually 5σ corresponds to α =2.87 10 7 " eg. a very small chance we reject the standard model In the simple case of number counting it is obvious what region is sensitive to the presence of a new signal! but in higher dimensions it is not so easy Probability 0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 50 Events 60 80 100 120 140 160 180 Events Observed ±10 58
Hypothesis testing The idea of a 5σ discovery criteria for particle physics is really a conventional way to specify the size of the test! usually 5σ corresponds to α =2.87 10 7 " eg. a very small chance we reject the standard model In the simple case of number counting it is obvious what region is sensitive to the presence of a new signal! but in higher dimensions it is not so easy 1.5 1.5 1.5 1 1 1 x 1 x 2 x 2 0.5 0.5 0.5 0 0.5 1 1.5 x 0 0 0.5 1 1.5 x 0 0 0.5 1 1.5 x 1 58
Hypothesis testing The idea of a 5σ discovery criteria for particle physics is really a conventional way to specify the size of the test! usually 5σ corresponds to α =2.87 10 7 " eg. a very small chance we reject the standard model In the simple case of number counting it is obvious what region is sensitive to the presence of a new signal! but in higher dimensions it is not so easy Probability 0.05 0.045 0.04 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 50 Events 60 80 100 120 140 160 180 Events Observed ±10 59
Hypothesis testing The idea of a 5σ discovery criteria for particle physics is really a conventional way to specify the size of the test! usually 5σ corresponds to α =2.87 10 7 " eg. a very small chance we reject the standard model In the simple case of number counting it is obvious what region is sensitive to the presence of a new signal! but in higher dimensions it is not so easy ecision bounda H 1 [G. Cowan] ion boundary: 0.05 0.045 0.04 H 0 0.035 0.03 0.025 0.02 0.015 0.01 0.005 0 Probability 50 Events 60 80 100 120 140 160 180 Events Observed ±10 59
Hypothesis testing The idea of a 5σ discovery criteria for particle physics is really a conventional way to specify the size of the test! usually 5σ corresponds to α =2.87 10 7 " eg. a very small chance we reject the standard model In the simple case of number counting it is obvious what region is sensitive to the presence of a new signal! but in higher dimensions it is not so easy ecision bounda H 1 [G. Cowan] ion boundary: 0.05 0.045 0.04 H 0 0.035 0.03 0.025 0.02 0.015 0.01H 0.005 1 0 Probability 50 Events 60 80 100 120 140 160 180 Events Observed ±10 H 0 H 1 59
The Neyman-Pearson Lemma In 1928-1938 Neyman & Pearson developed a theory in which one must consider competing Hypotheses: - the Null Hypothesis H 0 (background only) - the Alternate Hypothesis H 1 (signal-plus-background) Given some probability that we wrongly reject the Null Hypothesis α = P (x / W H 0 ) (Convention: if data falls in W then we accept H0) Find the region W such that we minimize the probability of wrongly accepting the H 0 (when H 1 is true) β = P (x W H 1 ) 60
The Neyman-Pearson Lemma The region accepting H 0 W that minimizes the probability of wrongly is just a contour of the Likelihood Ratio P (x H 1 ) P (x H 0 ) >k α Any other region of the same size will have less power The likelihood ratio is an example of a Test Statistic, eg. a real-valued function that summarizes the data in a way relevant to the hypotheses that are being tested 61
A short proof of Neyman-Pearson W W C P (x H 1 ) P (x H 0 ) >k α Consider the contour of the likelihood ratio that has size a given size (eg. probability under H0 is 1- ) α 62
A short proof of Neyman-Pearson Now consider a variation on the contour that has the same size 63
A short proof of Neyman-Pearson P ( H 0 )=P ( H 0 ) Now consider a variation on the contour that has the same size (eg. same probability under H0) 64
A short proof of Neyman-Pearson P (x H 1 ) P (x H 0 ) <k α P ( H 0 )=P ( H 0 ) P ( H 1 ) <P( H 0 )k α Because the new area is outside the contour of the likelihood ratio, we have an inequality 65
A short proof of Neyman-Pearson P (x H 1 ) P (x H 0 ) <k α P ( H 0 )=P ( H 0 ) P (x H 1 ) P (x H 0 ) >k α P ( H 1 ) <P( H 0 ) k α P ( H 1 ) >P( H 0 ) k α And for the region we lost, we also have an inequality Together they give... 66
A short proof of Neyman-Pearson P (x H 1 ) P (x H 0 ) <k α P ( H 0 )=P ( H 0 ) P (x H 1 ) P (x H 0 ) >k α P ( H 1 ) <P( H 0 ) k α P ( H 1 ) >P( H 0 ) k α P ( H 1 ) <P( H 1 ) The new region region has less power. 67
Decision Theory One of the deficiencies of the Neyman-Pearson approach is that one must specify the size of the test α! But where does come from? " is it purely conventional or is there a reason? α A great deal of literature related to statistics (and economics, etc.) is devoted to making decisions.! need to consider Utility or Risk of di#erent outcomes In the context of decision and utility theory there can be a justification, but this is rarely done in particle physics 68
Probability density An explicit likelihood ratio In that case: 0.12 0.1 0.08 0.06 0.04 0.02 (a) Q = L(x H 1) L(x H 0 ) = 0-15 -10-5 0 5 10 15 Nchan LEP Observed m H = 115 GeV/c 2 Expected for background Expected for signal plus background i P ois(n i s i + b i ) n i Nchan N chan q = ln Q = s tot -2 ln(q) -2 ln(q) 50 40 30 20 10-10 -20-30 + 0 i i j s i f s (x ij )+b i f b (x ij ) s i +b i P ois(n i b i ) n i j f b (x ij ) ( ln 1+ s ) if s (x ij ) b i f b (x ij ) n i j Observed Expected for background Expected for signal plus background LEP 106 108 110 112 114 116 118 120 m (GeV/c 2 ) 69
Kernel Estimation Kernel estimation is the generalization of Average Shifted Histograms ˆf 1 (x) = h(x i )= n i ( 4 3 1 nh(x i ) K ( ) x xi h(x i ) ) 1/5 σ ˆf 0 (x i ) n 1/5 Probability Density K.Cranmer, Comput.Phys.Commun. 136 (2001). [hep-ex/0011057] the data is the model Neural Network Output Adaptive Kernel estimation puts wider kernels in regions of low probability Used at LEP for describing pdfs from Monte Carlo (KEYS) 33
2 discriminating variables Often one uses the output of a neural network or multivariate algorithm in place of a true likelihood ratio.! That s fine, but what do you do with it?! If you have a fixed cut for all events, this is what you are doing: y 1 y 2 x 1 x 2 ( q = ln Q = s + ln 1+ sf ) s(x, y) bf b (x, y) q 70
Experiments vs. Events Ideally, you want to cut on the likelihood ratio for your experiment! equivalent to a sum of log likelihood ratios Easy to see that includes experiments where one event had a high LR and the other one was relatively small q 2 q 12 = q 1 + q 2 q 1 q 1 q 2 q 12 y 1 y 2 x 1 x 2 71
Decision Theory α α From Fred James lectures 72
Decisions: Bayesian & Frequentist Structure of P(x H0) & P(x H1) puts limits on allowable ranges of alpha, beta! Bayesians want to minimize expected risk based on priors and risk/utility of outcomes Frequentists don t have priors to work with, so they only have risk/ utility in two situations! minimax approach aims to minimize maximum risk " most conservative F. James, Ch. 6 " paranoid for games against nature Frequentist choice of alpha interpreted in Bayesian framework implies this ratio: OD OA = l 0 µ l 1 (1 µ) l 1 (1 µ)p ( X H 1 ) < l 0 µp ( X H 0 ) P (X H 1 ) P (X H 0 ) < l 0 µ l 1 (1 µ) 73
A Few Slides on Multivariate Algorithms 74
Use of Multivariate Methods Multivariate methods are now ubiquitous in high-energy physics, the nagging problem is that:! most multivariate techniques are borrowed from other fields, and they optimize some heuristic that physicists aren t interested in (like a score, or ad hoc training error)! the di#erence can be quite large when systematic uncertainties are taken into account A few recent developments! Evolutionary techniques! Matrix Element techniques ] 2 Statistical Uncertainty [GeV/c 14 12 10 8 6 4 2 0 Whiteson & Whiteson, hep-ex/0607012 11.1 ± 0.3 10.1 ± 0.4 10.0 ± 0.5 Heuristic Binary-C Multi-class 9.1 ± 0.4 Binary-M 7.3 ± 0.3 7.1 ± 0.2 NEAT classes NEAT features 75
The Neyman-Pearson Lemma The region W that minimizes the probability of wrongly accepting the H 0 is just a contour of the Likelihood Ratio: L(x H 0 ) L(x H 1 ) > k α This is the goal! The problem is we don t have access to L(x H 0 ) & L(x H 1 ) 76
The Neyman-Pearson Lemma The region W that minimizes the probability of wrongly accepting the H 0 is just a contour of the Likelihood Ratio: L(x H 0 ) L(x H 1 ) > k α This is the goal! W L(x H 0 ) = H µ+ The problem is we don t have access to L(x H 0 ) & L(x H 1 ) W µ 76
Matrix Element Techniques Instead of using generic machine learning algorithms, some members of the Tevatron experiments are starting to attack this convolution numerically L(x H 0 ) = W Phase-space Integral W H µ+ Matrixµ Element Transfer Functions 77
Matrix Element Techniques Instead of using generic machine learning algorithms, some members of the Tevatron experiments are starting to attack this convolution numerically Phase-space Integral Matrix Element Transfer Functions 77
Matrix Element Techniques for Theorists A few years ago, I realized that phenomenologists doing sensitivity studies can use the Neyman-Pearson lemma directly! directly integrate likelihood ratio! model detector e#ects with transfer functions " numerically much easier than experimental situation because one generates hypothetical data! just as one computes a cross-section for a new signal, one can compute a maximum significance (at leading order) Experimental: x ~ observable Q(x) = L(x H n 1) Pois(n s + b) j = f s+b(x j ) L(x H 0 ) Pois(n b) n j f b(x j ) n ( q(x) ln Q(x) = s + ln 1 + sf ) s(x j ) bf b (x j ) j=1 Theoretical: r ~ phase space q( r) = σ tot,s L + ln ( 1 + dσ ) s( r) dσ b ( r) Cranmer, Plehn. EPJ & hep-ph/0605268 78
79
Learning Machines 80
Examples of Learning Machines y 4 Cuts can be viewed as learning machines 3 1 2 x f = 1 <x< 1 2 and <y< 3 4 0 else 1 2 Neural Nets can be viewed as learning machines 3 4 5 6 7 weights & biases make up the parameters Input Units Output Unit Hidden Layers: Processing Units 81
Statistical Learning Theory & Hypothesis Testing exactly 82
Limits on Risk h(log(2l/h) + 1) log(η/4) l 1.4 1.2 VC Confidence 1 0.8 0.6 For Sample Size of 10,000 95% Confidence Level 0.4 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 h / l = VC Dimension / Sample Size 83
Limits on Risk h(log(2l/h) + 1) log(η/4) l 1.4 1.2 VC Confidence 1 0.8 0.6 For Sample Size of 10,000 95% Confidence Level 0.4 0.2 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 h / l = VC Dimension / Sample Size Support Vector Machines aim to minimize the limit on Risk by balancing Remp and complexity of learning machine characterized by h 83
Some personal history Archbishop of Canterbury Thomas Cranmer (born: 1489, executed: 1556) author of the Book of Common Prayer Kyle Cranmer (NYU) Two centuries later (when this Book had become an official prayer book of the Church of England) Thomas Bayes was a non-conformist minister (Presbyterian) who refused to use Cranmer!s book CERN Academic Training, Feb 2-5, 2009 84
VC Dimension 85
Importance of VC Dimension 86
Importance of VC Dimension Because we usually have an independent testing set, the limit on true Risk is often not very useful in practice 86
Genetic Programming R.S. Bowman and I brought a technique called Genetic Programming to HEP. It s a program that actually writes programs to search for the Higgs! Comput. Phys. Commun [physics/0402030] XOR AND <=> min > Iso1 POT XOR σm The FOCUS collaboration has recently used Genetic Programming to study doubly Cabibbo suppressed decay of D + K + π + π relative to Cabbibo favored D + K π + π + Iso2 p OoT π e2 NOT POT σm hep-ex/0503007 a) Selected CF b) Selected DCS 2 Events/5 MeV/c 12000 10000 Yield = 62441 ± 255 180 160 140 120 8000 100 6000 80 4000 2000 60 40 20 Yield = 466 ± 36 0 1.76 1.78 1.8 1.82 1.84 1.86 1.88 1.9 1.92 1.94 2 GeV/c 0 1.76 1.78 1.8 1.82 1.84 1.86 1.88 1.9 1.92 1.94 2 GeV/c 87
Remaining Lectures Lecture 3:! The Neyman-Construction (illustrated)! Inverted hypothesis tests: A dictionary for limits (intervals)! Coverage as a calibration for our statistical device! Compound hypotheses, nuisance parameters, & similar tests! Systematics, Systematics, Systematics Lecture 4:! Generalizing our procedures to include systematics! Eliminating nuisance parameters: profiling and marginalization! Introduction to ancillary statistics & conditioning! High dimensional models, Markov Chain Monte Carlo, and Hierarchical Bayes! The look elsewhere e#ect and false discovery rate 88