Statistical Methods for LHC Physics

Similar documents
Advanced statistical methods for data analysis Lecture 1

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Multivariate statistical methods and data mining in particle physics

Systematic uncertainties in statistical data analysis for particle physics. DESY Seminar Hamburg, 31 March, 2009

Statistics for the LHC Lecture 1: Introduction

Statistical Methods for Particle Physics Lecture 2: statistical tests, multivariate methods

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

Statistics for the LHC Lecture 2: Discovery

Statistical Methods for Particle Physics (I)

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

Statistical Methods for Particle Physics Lecture 2: multivariate methods

Statistical Methods in Particle Physics Lecture 2: Limits and Discovery

Lecture 5. G. Cowan Lectures on Statistical Data Analysis Lecture 5 page 1

Statistical Data Analysis Stat 2: Monte Carlo Method, Statistical Tests

Statistical Methods for Particle Physics Lecture 1: parameter estimation, statistical tests

Discovery Potential for the Standard Model Higgs at ATLAS

Statistical Tools in Collider Experiments. Multivariate analysis in high energy physics

Recent developments in statistical methods for particle physics

Statistics Forum News

Some Statistical Tools for Particle Physics

Study of Diboson Physics with the ATLAS Detector at LHC

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Top quark pair cross section measurements at the Tevatron experiments and ATLAS. Flera Rizatdinova (Oklahoma State University)

Overview of Statistical Methods in Particle Physics

Advanced statistical methods for data analysis Lecture 2

Linear & nonlinear classifiers

Evidence for Single Top Quark Production. Reinhard Schwienhorst

Discovery potential of the SM Higgs with ATLAS

An introduction to Bayesian reasoning in particle physics

Statistical Methods for Particle Physics Lecture 4: discovery, exclusion limits

Searching for the Higgs at the Tevatron

Characterization of Jet Charge at the LHC

How to find a Higgs boson. Jonathan Hays QMUL 12 th October 2012

Introductory Statistics Course Part II

Support Vector Machine (SVM) and Kernel Methods

Statistical Methods for Particle Physics Tutorial on multivariate methods

Linear & nonlinear classifiers

Support Vector Machine (SVM) and Kernel Methods

Analysis Techniques Multivariate Methods

Statistical Methods for Discovery and Limits in HEP Experiments Day 3: Exclusion Limits

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Combined Higgs Results

Machine Learning Lecture 7

Topics in Statistical Data Analysis for HEP Lecture 1: Bayesian Methods CERN-JINR European School of High Energy Physics Bautzen, June 2009

Statistical Methods in Particle Physics Day 4: Discovery and limits

Jeff Howbert Introduction to Machine Learning Winter

Statistical Methods in Particle Physics

The search for standard model Higgs boson in ZH l + l b b channel at DØ

Topics in statistical data analysis for high-energy physics

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Statistical Data Analysis Stat 3: p-values, parameter estimation

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Two Early Exotic searches with dijet events at ATLAS

Statistical Methods for Particle Physics Lecture 3: systematic uncertainties / further topics

PoS(ICHEP2012)238. Search for B 0 s µ + µ and other exclusive B decays with the ATLAS detector. Paolo Iengo

La ricerca dell Higgs Standard Model a CDF

Machine Learning Lecture 5

Review of Higgs results at LHC (ATLAS and CMS results)

Single top quark production at CDF

Statistical Tools in Collider Experiments. Multivariate analysis in high energy physics

Dark matter searches and prospects at the ATLAS experiment

Statistical Methods for Particle Physics Lecture 3: Systematics, nuisance parameters

FINAL: CS 6375 (Machine Learning) Fall 2014

Higgs Boson Physics at the Tevatron

Higgs Production at LHC

Primer on statistics:

Observation of a New Particle with a Mass of 125 GeV

Overview of the Higgs boson property studies at the LHC

Measurements of the Higgs Boson at the LHC and Tevatron

Multivariate Methods in Statistical Data Analysis

Search for high mass diphoton resonances at CMS

Statistics Challenges in High Energy Physics Search Experiments

LHC State of the Art and News

Thesis. Wei Tang. 1 Abstract 3. 3 Experiment Background The Large Hadron Collider The ATLAS Detector... 4

arxiv: v1 [hep-ex] 9 Jul 2013

Final status of L3 Standard Model Higgs searches and latest results from LEP wide combinations. Andre Holzner ETHZ / L3

Statistical methods in CMS searches

First two sided limit on BR(B s μ + μ - ) Matthew Herndon, University of Wisconsin Madison SUSY M. Herndon, SUSY

Introduction to Statistical Methods for High Energy Physics

Identification of the Higgs boson produced in association with top quark pairs in proton-proton

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Optimizing Selection and Sensitivity Results for VV->lvqq, 6.5 pb -1, 13 TeV Data

ATLAS and CMS Higgs Mass and Spin/CP Measurement

Higgs couplings and mass measurements with ATLAS. Krisztian Peters CERN On behalf of the ATLAS Collaboration

Kernel Methods and Support Vector Machines

Search for the Higgs boson in fermionic channels using the CMS detector

Learning Methods for Linear Detectors

Measurement of jet production in association with a Z boson at the LHC & Jet energy correction & calibration at HLT in CMS

Why I never believed the Tevatron Higgs sensitivity claims for Run 2ab Michael Dittmar

Search for a heavy gauge boson W e

Support Vector Machine (SVM) and Kernel Methods

Search for the Higgs Boson In HWW and HZZ final states with CMS detector. Dmytro Kovalskyi (UCSB)

Georges Aad For the ATLAS and CMS Collaboration CPPM, Aix-Marseille Université, CNRS/IN2P3, Marseille, France

Search for top squark pair production and decay in four bodies, with two leptons in the final state, at the ATLAS Experiment with LHC Run2 data

Albert De Roeck CERN, Geneva, Switzerland Antwerp University Belgium UC-Davis University USA IPPP, Durham UK. 14 September 2012

Title Text. ATLAS Higgs Boson Discovery Potential

Some Topics in Statistical Data Analysis

Transcription:

RHUL Physics www.pp.rhul.ac.uk/~cowan Particle Physics Seminar University of Glasgow 26 February, 2009 1

Outline Quick overview of physics at the Large Hadron Collider (LHC) New multivariate methods for event selection Boosted Decision Trees Support Vector Machines Some thoughts on Bayesian methods Statistical combination of search channels Outlook for data analysis at the LHC 2

Data analysis at the LHC The LHC experiments are expensive ~ $1010 (accelerator and experiments) the competition is intense (ATLAS vs. CMS) vs. Tevatron and the stakes are high: 4 sigma effect 5 sigma effect So there is a strong motivation to extract all possible information from the data. 3

The Large Hadron Collider Counter rotating proton beams in 27 km circumference ring pp centre of mass energy 14 TeV 10 TeV for 2009/10) Detectors at 4 pp collision points: ATLAS general purpose CMS LHCb (b physics) ALICE (heavy ion physics) 4

The ATLAS detector 2100 physicists 37 countries 167 universities/labs 25 m diameter 46 m length 7000 tonnes ~108 electronic channels 5

The Standard Model of particle physics Matter... + gauge bosons... photon ( ), W±, Z, gluon (g) + relativity + quantum mechanics + symmetries... = Standard Model 25 free parameters (masses, coupling strengths,...). Includes Higgs boson (not yet seen). Almost certainly incomplete (e.g. no gravity). Agrees with all experimental observations so far. Many candidate extensions to SM (supersymmetry, extra dimensions,...) 6

A simulated SUSY event in ATLAS high pt jets high pt of hadrons muons p p missing transverse energy 7

Background events This event from Standard Model ttbar production also has high pt jets and muons, and some missing transverse energy. can easily mimic a SUSY event. 8

LHC data At LHC, ~109 pp collision events per second, mostly uninteresting do quick sifting, record ~200 events/sec single event ~ 1 Mbyte 1 year 107 s, 1016 pp collisions / year 2 109 events recorded / year (~2 Pbyte / year) For new/rare processes, rates at LHC can be vanishingly small e.g. Higgs bosons detectable per year could be ~103 'needle in a haystack' For Standard Model and (many) non SM processes we can generate simulated data with Monte Carlo programs (including simulation of the detector). 9

A simulated event PYTHIA Monte Carlo pp gluino gluino... 10

Multivariate event selection Suppose for each event we measure a set of numbers x = x 1,, x n x1 = jet pt x2 = missing energy x3 = particle i.d. measure,... x follows some n dimensional joint probability density, which depends on the type of event produced, i.e., was it pp t t, pp g g, p x H 1 xj E.g. hypotheses (class labels) H0, H1,... Often simply signal, background p x H 0 xi We want to separate (classify) the event types in a way that exploits the information carried in many variables. 11

Finding an optimal decision boundary H0 Maybe select events with cuts : xi < ci xj < cj H1 Or maybe use some other type of decision boundary: H0 H1 H0 H1 Goal of multivariate analysis is to do this in an optimal way. 12

Test statistics The decision boundary is a surface in the n dimensional space of input variables, e.g., y x =const. We can treat the y(x) as a scalar test statistic or discriminating function, and try to define this function so that its distribution has the maximum possible separation between the event types: The decision boundary is now effectively a single cut on y(x), dividing x space into two f y H 0 regions: R0 (accept H0) y cut f y H 1 R1 (reject H0) y x 13

Constructing a test statistic The Neyman Pearson lemma states: to obtain the highest background rejection for a given signal efficiency (highest power for a given significance level), choose the acceptance region for signal such that p x s c p x b where c is a constant that determines the signal efficiency. Equivalently, the optimal discriminating function is given by the likelihood ratio: p x s y x = p x b N.B. any monotonic function of this is just as good. 14

Neyman Pearson doesn't always help The problem is that we usually don't have explicit formulae for the pdfs p(x s), p(x b), so for a given x we can't evaluate the likelihood ratio. Instead we have Monte Carlo models for signal and background processes, so we can produce simulated data: training data generate x ~ p x s events of known type x 1,, x N s generate x ~ p x b x 1,, x N b Naive try: enter each (s,b) event into an n dimensional histogram, use e.g. M bins for each of the n dimensions, total of Mn cells. n is potentially large prohibitively large number of cells to populate, can't generate enough training data. 15

General considerations In all multivariate analyses we must consider e.g. Choice of variables to use Functional form of decision boundary (type of classifier) Computational issues Trade off between sensitivity and complexity Trade off between statistical and systematic uncertainty Our choices can depend on goals of the analysis, e.g., Event selection for further study Searches for new event types 16

Decision boundary flexibility The decision boundary will be defined by some free parameters that we adjust using training data (of known type) to achieve the best separation between the event types. Goal is to determine the boundary using a finite amount of training data so as to best separate between the event types for an unseen data sample. overtraining boundary too rigid good trade off 17

Some standard multivariate methods Place cuts on individual variables Simple, intuitive, in general not optimal Linear discriminant (e.g. Fisher) Simple, optimal if the event types are Gaussian distributed with equal covariance, otherwise not optimal. Probability Density Estimation based methods Try to estimate p(x s), p(x b) then use y x = p x s / p x b. In principle best, difficult to estimate p(x) for high dimension. Neural networks Can produce arbitrary decision boundary (in principle optimal), but can be difficult to train, result non intuitive. 18

Decision trees In a decision tree repeated cuts are made on a single variable until some stop criterion is reached. The decision as to which variable is used is based on best achieved improvement in signal purity: signal wi P= signal wi background wi where wi. is the weight of the ith event. Iterate until stop criterion reached, based e.g. on purity and minimum number of events in a node. Example by MiniBooNE experiment, B. Roe et al., NIM 543 (2005) 577 19

Decision trees (2) The terminal nodes (leaves) are classified as signal or background depending on majority vote (or e.g. signal fraction greater than a specified threshold). This classifies every point in input variable space as either signal or background, a decision tree classifier, with the discriminant function f x =1 if x signal region, 1 otherwise Decision trees tend to be very sensitive to statistical fluctuations in the training sample. Methods such as boosting can be used to stabilize the tree. 20

Boosting Boosting is a general method of creating a set of classifiers which can be combined to achieve a new classifier that is more stable and has a smaller error than any individual one. Often applied to decision trees but, can be applied to any classifier. Suppose we have a training sample T consisting of N events with x1,..., xn event data vectors (each x multivariate) y1,..., yn true class labels, +1 for signal, 1 for background w1,..., wn event weights Now define a rule to create from this an ensemble of training samples T1, T2,..., derive a classifier from each and average them. 21

AdaBoost A successful boosting algorithm is AdaBoost (Freund & Schapire, 1997). First initialize the training sample T1 using the original x1,..., xn event data vectors y1,..., yn true class labels (+1 or 1) w1(1),..., wn(1) event weights with the weights equal and normalized such that N 1 w i =1. i=1 Train the classifier f1(x) (e.g. a decision tree) using the weights w(1) so as to minimize the classification error rate, N 1= w 1 i I y i f 1 x i 0, i=1 where I(X) = 1 if X is true and is zero otherwise. 22

Updating the event weights (AdaBoost) Assign a score to the kth classifier based on its error rate: 1 k k =ln k Define the training sample for step k+1 from that of k by updating the event weights according to w i = event index k 1 i =w k i k f k x i y i /2 e Zk k = training sample index K Normalize so that k 1 w i =1 i Iterate K times, final classifier is f x = k f k x,t k k=1 23

BDT example from MiniBooNE ~200 input variables for each event ( interaction producing e, or Each individual tree is relatively weak, with a misclassification error rate ~ 0.4 0.45 B. Roe et al., NIM 543 (2005) 577 24

Monitoring overtraining From MiniBooNE example 25

Boosted decision tree summary Advantage of boosted decision tree is it can handle a large number of inputs. Those that provide little/no separation are rarely used as tree splitters are effectively ignored. Easy to deal with inputs of mixed types (real, integer, categorical...). If a tree has only a few leaves it is easy to visualize (but rarely use only a single tree). There are a number of boosting algorithms, which differ primarily in the rule for updating the weights ( Boost, LogitBoost,...) Other ways of combining weaker classifiers: Bagging (Boostrap Aggregating), generates the ensemble of classifiers by random sampling with replacement from the full training sample. 26

Support Vector Machines Support Vector Machines (SVMs) are an example of a kernel based classifier, which exploits a nonlinear mapping of the input variables onto a higher dimensional feature space. The SVM finds a linear decision boundary in the higher dimensional space. But thanks to the kernel trick one does not every have to write down explicitly the feature space transformation. Some references for kernel methods and SVMs: The books mentioned in www.pp.rhul.ac.uk/~cowan/mainz_lectures.html C. Burges, A Tutorial on Support Vector Machines for Pattern Recognition, research.microsoft.com/~cburges/papers/svmtutorial.pdf N. Cristianini and J.Shawe Taylor. An Introduction to Support Vector Machines and other kernel based learning methods. Cambridge University Press, 2000. The TMVA manual (!) 27

Linear SVMs Consider a training data set consisting of x1,..., xn event data vectors y1,..., yn true class labels (+1 or 1) Suppose the classes can be separated by a hyperplane defined by a normal vector w and scalar offset b (the bias ). We have x i w b 1 for all yi = +1 x i w b 1 for all yi = 1 or equivalently yi x i w b 1 0 for all i Bishop Ch. 7 28

Margin and support vectors The distance between the hyperplanes defined by y(x) x w + b = 1 and y(x) = 1 is called the margin, which is: 2 margin= w If the training data are perfectly separated then this means there are no points inside the margin. Suppose there are points on the margin (this is equivalent to defining the scale of w). These points are called support vectors. 29

Linear SVM classifier We can define the classifier using y x = x w b which is >0 for points on one side of the hyperplane and <0 on the other. The best classifier should have a large margin, so to maximize 2 margin= w we can minimize w 2 subject to the constraints yi x i w b 1 0 for all i 30

Lagrangian formulation This constrained minimization problem can be reformulated using a Lagrangian N 1 2 L= w i y i x i w b 1 2 i=1 positive Lagrange multipliers i We need to minimize L with respect to w and b and maximize with respect to i. There is an i for every training point. Those that lie on the margin (the support vectors) have i > 0, all others have i = 0. The solution can be written (sum only contains w= i yi x i support vectors) i 31

Dual formulation The classifier output is thus y x = x w b= i y i xi x b i It can be shown that one finds the same solution a by minimizing the dual Lagrangian 1 L D = i i j y i y j xi x j 2 i, j i So this means that both the classifier function and the Lagrangian only involve dot products of vectors in the input variable space. 32

Nonseparable data If the training data points cannot be separated by a hyperplane, one can redefine the constraints by adding slack variables i: yi x i w b i 1 0 with i 0 for all i Thus the training point xi is allowed to be up to a distance i on the wrong side of the boundary, and i = 0 at or on the right side of the boundary. For an error to occur we have i > 1, so i i is an upper bound on the number of training errors. 33

Cost function for nonseparable case To limit the magnitudes of the i we can define the error function that we minimize to determine w to be 2 1 E w = w C 2 i k i where C is a cost parameter we must choose that limits the amount of misclassification. It turns out that for k=1 or 2 this is a quadratic programming problem and furthermore for k=1 it corresponds to minimizing the same dual Lagrangian 1 L D = i i j yi y j xi x j 2 i, j i where the constraints on the i become 0 i C. 34

Nonlinear SVM So far we have only reformulated a way to determine a linear classifier, which we know is useful only in limited circumstances. But the important extension to nonlinear classifiers comes from first transforming the input variables to feature space: x = 1 x,, m x These will behave just as our new input variables. Everything about the mathematical formulation of the SVM will look the same as before except with (x) appearing in the place of x. 35

Only dot products Recall the SVM problem was formulated entirely in terms of dot products of the input variables, e.g., the classifier output is y x = i y i xi x b i so in the feature space this becomes xi y x = i y i x b i 36

The Kernel trick How do the dot products help? It turns on that a broad class of kernel functions can be written in the form: x K x, x ' = x ' Functions having this property must satisfy Mercer's condition K x, x ' g x g x ' d x d x ' 0 for any function g where g 2 x d x is finite. So we don't even need to find explicitly the feature space transformation (x), we only need a kernel. 37

Finding kernels There are a number of techniques for finding kernels, e.g., constructing new ones from known ones according to certain rules (cf. Bishop Ch 6). Frequently used kernels to construct classifiers are e.g. K x, x ' = x x ' K x, x ' =exp polynomial p 2 x x ' 2 2 K x, x ' =tanh x x ' Gaussian sigmoidal 38

Using an SVM To use an SVM the user must as a minimum choose a kernel function (e.g. Gaussian) any free parameters in the kernel (e.g. the of the Gaussian) the cost parameter C (plays role of regularization parameter) The training is relatively straightforward because, in contrast to neural networks, the function to be minimized has a single global minimum. Furthermore evaluating the classifier only requires that one retain and sum over the support vectors, a relatively small number of points. 39

SVM in HEP SVMs are very popular in the Machine Learning community but have yet to find wide application in HEP. Here is an early example from a CDF top quark anlaysis (A. Vaiciulis, contribution to PHYSTAT02). signal eff. 40

41

42

Software for multivariate analysis TMVA, Höcker, Stelzer, Tegenfeldt, Voss, Voss, physics/0703039 From tmva.sourceforge.net, also distributed with ROOT Variety of classifiers Good manual StatPatternRecognition, I. Narsky, physics/0507143 Further info from www.hep.caltech.edu/~narsky/spr.html Also wide variety of methods, many complementary to TMVA Currently appears project no longer to be supported 43

44

Bayesian vs. frequentist methods Two schools of statistics use different interpretations of probability: I. Relative frequency (frequentist statistics): II. Subjective probability (Bayesian statistics): In particle physics frequency interpretation most used, but subjective probability can be more natural for non repeatable phenomena: systematic uncertainties, probability that Higgs boson exists... 45

46

47

Statistical vs. systematic errors Statistical errors: How much would the result fluctuate upon repetition of the measurement? Implies some set of assumptions to define probability of outcome of the measurement. Systematic errors: What is the uncertainty in my result due to uncertainty in my assumptions, e.g., model (theoretical) uncertainty; modelling of measurement apparatus. The sources of error do not vary upon repetition of the measurement. Often result from uncertain value of, e.g., calibration constants, efficiencies, etc. 48

49

50

51

52

53

54

55

Example: posterior pdf from MCMC Sample the posterior pdf from previous example with MCMC: Summarize pdf of parameter of interest with, e.g., mean, median, standard deviation, etc. Although numerical values of answer here same as in frequentist case, interpretation is different (sometimes unimportant?) 56

57

58

59

Statistical combination of Higgs channels in ATLAS Status of Higgs search recently published in ATLAS "CSC Book" Expected Performance of the ATLAS Experiment: Detector, Trigger and Physics, arxiv:0901.0512, CERN OPEN 2008 20. Several statistical methods under study for combination of channels. Here show method using profile likelihood For now, concentrate on combining "discovery modes": H γγ H W+W eνµν H ZZ(*) 4l (l = e, µ) H τ+τ ll, lh Some modes not yet included, e.g., WW eνeν, µνµν, lνqq; sensitivity will improve 60

Higgs production at the LHC 61

Statistical combination of channels Treat systematics by means of profile likelihood method. Considers fixed mass hypothesis; "look elsewhere effect" must be studied separately. NB γγ and WW analyses also study floating mass fits. Approximate methods for discovery/exclusion significance. Valid for L > 2 fb 1. For lower L need MC methods. Other developments ongoing (Bayesian, CLs, look elsewhere effect,...) 62

Statistical model Bin i of a given channel has ni events, expectation value is µ is global strength parameter, common to all channels. µ = 0 means background only, µ = 1 is SM hypothesis. Expected signal and background are: btot, θs, θb are nuisance parameters 63

The likelihood function The single channel likelihood function uses Poisson model for events in signal and control histograms: data in control data in signal histogram histogram here signal rate is only parameter of interest θ represents all nuisance parameters, e.g., background rate, shapes There is a likelihood Li(µ,θi) for each channel, i = 1,, N. The full likelihood function is 64

Profile likelihood ratio To test hypothesized value of µ, construct profile likelihood ratio: Maximized L for given µ Maximized L Equivalently use qµ = 2 ln λ(µ): data agree well with hypothesized µ qµ small data disagree with hypothesized µ qµ large Distribution of qµ under assumption of µ related to chi square (Wilk's theorem, approximation valid for roughly L > 2 fb 1): 65

p value / significance of hypothesized µ Test hypothesized µ by giving p value, probability to see data with compatibility with µ compared to data observed: Equivalently use significance, Z, defined as equivalent number of sigmas for a Gaussian fluctuation in one direction: 66

Sensitivity Discovery: Generate data under s+b (µ = 1) hypothesis; Test hypothesis µ = 0 p value Z. Exclusion: Generate data under background only (µ = 0) hypothesis; Test hypothesis µ = 1. If µ = 1 has p value < 0.05 exclude mh at 95% CL. Estimate median significance by setting data equal to expectation values (Asimov data) or by using MC. For median, can combine significances of individual channels. For significance of e.g. a real data set, need global fit. 67

ATLAS combined discovery significance 68

ATLAS combined 95% CL exclusion limits 69

Summary of ATLAS Higgs combination Study shown only uses Higgs channels most relevant to discovery some modes not yet included sensitivity will improve Systematic uncertainties treated using profile likelihood. Other developments ongoing (Bayesian, CLs, look elsewhere effect,...) Due to approximations used, combined results valid for L > 2 fb 1 For L = 2fb 1 for ATLAS: expected discovery 5σ or more in 143 < mh < 179 GeV, expected upper limit on mh at 95% CL is 115 GeV Discussions underway on ATLAS/CMS combinations. Software for global fit and full combination under development as part of the RooStats project (joint ATLAS/CMS/RooFit/Root) available as part of ROOT 5.22 70

Outlook for data analysis at the LHC Recent developments from Machine Learning provide some new tools for event classification with a number of advantages over those methods commonly used in HEP, e.g., Boosted decision trees Support Vector Machines Bayesian methods can allow for a more natural treatment of non repeatable phenomena, e.g., model uncertainties. MCMC can be used to marginalize posterior probabilities Active development of combination methods for searches Profile likelihood, Bayesian, LEP style,... Software for these methods now much more easily available, expect rapid development as the LHC begins to produce real results. 71

Quotes I like Keep it simple. As simple as possible. Not any simpler. A. Einstein If you believe in something you don't understand, you suffer,... Stevie Wonder 72

Extra slides 73

Probability quick review Frequentist (A = outcome of repeatable observation): Subjective (A = hypothesis): Conditional probability: Bayes' theorem: 74

MCMC basics: Metropolis Hastings algorithm Goal: given an n dimensional pdf generate a sequence of points Proposal density e.g. Gaussian centred about 1) Start at some point 2) Generate 3) Form Hastings test ratio 4) Generate 5) If else move to proposed point old point repeated 6) Iterate 75

Metropolis Hastings (continued) This rule produces a correlated sequence of points (note how each new point depends on the previous one). For our purposes this correlation is not fatal, but statistical errors larger than naive The proposal density can be (almost) anything, but choose so as to minimize autocorrelation. Often take proposal density symmetric: Test ratio is (Metropolis Hastings): I.e. if the proposed step is to a point of higher if not, only take the step with probability If proposed step rejected, hop in place., take it; 76

Metropolis Hastings caveats Actually one can only prove that the sequence of points follows the desired pdf in the limit where it runs forever. There may be a burn in period where the sequence does not initially follow Unfortunately there are few useful theorems to tell us when the sequence has converged. Look at trace plots, autocorrelation. Check result with different proposal density. Try using a number of different starting points and look at spread in results. 77

The "look elsewhere effect" Look for Higgs at many mh values probability of seeing a large fluctuation for some mh increased. Combined significance shown here relates to fixed mh. False discovery prob enhanced by ~ mass region explored / σm For H γγ and H WW, studied by allowing mh to float in fit: Η γγ 78

Expected discovery significance 79

Expected p value of SM vs mh 80

Example from validation exercise: ZZ(*) 4l Distributions of q0 for 2, 10 fb 1 from MC compared to ½χ2 ½χ2 ½χ2 ½χ2 ½χ2 (One minus) cumulative distributions. Band gives 68% CL limits. 5σ level 81

~ 82

83

84

b b 85

p y 86

p y 87

post erior 88