Statistical Tools in Collider Experiments. Multivariate analysis in high energy physics

Statistical Tools in Collider Experiments Multivariate analysis in high energy physics Pauli Lectures - 06/0/01 Nicolas Chanon - ETH Zürich 1

Main goals of these lessons - Have an understanding of what are multivariate analyses - How they are used in high energy physics - Answer to the questions : what is a neural network? a boosted decision tree? what are the multivariate methods currently used in HEP? - Become familiar with problems related with training and application of multivariate methods - Be aware of the systematic uncertainties related to multivariate techniques - Be able to understand the results of new physics searches at Tevatron or LHC in the form where they are presented usually, and how they were produced

Introductory comments - In these lectures, examples will be mainly taken from Higgs boson searches at LHC - Will focus on multivariate methods commonly used in the high energy physics community - Theory will be addressed as a tool for practical usage 3

Exercises - Proposed exercises will follow the progress of the lecture - Problem inspired by Higgs searches in H->photons channel at LHC - Goal : be able to estimate the sensitivity of a search for a small peak over a huge background, using multivariate methods - 3 exercises : - Setting up Root and TMVA environment, TMVA basics - Using a MVA method inside the analysis - Estimation of analysis sensitivity 4

Outline 1.Introduction.Multivariate methods 3.Optimization of MVA methods 4.Application of MVA methods in HEP 5.Understanding Tevatron and LHC results 5

Lecture 1. Introduction 6

Content of this lecture - Introduction - Experimental problems in high energy physics - The problem : how to distinguish signal from background? - Multivariate analyses examples in HEP - At the Tevatron - At the LHC - Presentation of commonly used multivariate methods 7

Searching for rare signals Higgs and new physics cross-sections are small... Examples of background to H ZZ searches 5 orders of magnitude 8

Over huge backgrounds To achieve a discovery, huge background reduction rate needed LHC (14 TeV) - Example of H γγ : typically 9 orders of magnitude under the QCD jets background - Reducible background : jet-jet, photon-jet - Jets can be mis-identified as photons! => can be suppressed by tight photon identification criteria - Irreducible background : photon-photon - Non-resonant diphoton continuum! => Can be discriminated using kinematic properties Other NP? 9

With a given detector (here, CMS) +!,-./.$0$1!,-./.$0$345 6*7(8&*./.$0$941 ":);<./.$0$94= "&7<:*&>&;'$>7?&$@A'BA;$ 6*7(8&*$7((&C'7;(&$./.$0$941

Experimental issues Experimental challenges : - Detector calibration - Identification of the tracks / energy deposits in the sub-detectors - Particle reconstruction - Particle identification - Finding the vertex of hard interaction among all pile-up vertices - Discriminate the signal process against all other background processes -... - Multivariate methods can help for that Collision with 0 pile-up events recorded with the ATLAS detector 11

Multivariate analysis : Definitions MultiVariate Analysis : - Set of statistical analysis methods that simultaneously analyze multiple measurements (variables) on the object studied - Variables can be dependent or correlated in various ways Classification / regression : - Classification : discriminant analysis to separate classes of events, given already known results on a training sample - Regression : analysis which provides an output variable taken into account the correlations of the input variables Statistical learning : - Supervised learning : the multivariate method is trained over a sample were the result is known (e.g. Monte-Carlo simulation of signal and background) - Unsupervised learning : no prior knowledge is required. The algorithm will cluster events in an optimal way 1

Event classification - Focus here on supervised learning for classification. - Use case in particle physics : signal/background discrimination - Assume we have two populations (signal and background) and two variables - How to decorrelate, what decision boundary (on X1 and X) to choose, to decide if an event is signal or background? 13

Event classification - Possible solutions : rectangular cuts, Fisher, non-linear contour Rectangular cuts Linear (Fisher) Non-linear 14

Multivariate analyses in HEP - Signal/background discrimination : - Object reconstruction : discriminate against instrumental background (electronic noise...) - Object identification : e.g. electron, bottom quark identification, to improve the rejection other objects resembling (e.g. jets) - Discriminating physics process against physics backgrounds. Many examples, e.g. single top against W+jets, H->WW against WW background... - Improving the energy measurement, via regression. Allows to narrow the reconstructed mass peak, improve the resolution. - Estimate the sensitivity of the analysis : - Sensitivity to signal exclusion or discoveries : Likelihood of the data to be consistent with background only or signal+background hypothesis - Combination of many channels! => exclusion limits or discoveries 15

MVA examples in HEP : Tevatron Single top discovery PhysRevLett.98.18180 (a) q q q q t W + (b) W b g b b t Event Yield 60 40 0 (a) H T < 175 GeV e+jets jets 1 tag 0 0 0. 0.4 0.6 0.8 1 tb+tqb Decision Tree Output Event Yield 0 (b) H T > 300 GeV e+jets 4 jets 1 tag 0 0 0. 0.4 0.6 0.8 1 tb+tqb Decision Tree Output - When published, very controversial - 36 boosted decision trees used to discriminate signal from background Event Yield (c) 1 Event Yield 5 (d) - First measurement of the single top cross-section, today well established 0.6 0.7 0.8 0.9 1 tb+tqb Decision Tree Output 0 0 150 00 50 M(W,b) [GeV] 16

MVA examples in HEP : Tevatron ZH llbb searches at CDF PRL 5, 5180 (0) Events / 8 GeV 1400 100 00 800 600 400 00 0 1400 PreTag data (after) 0 40 60 80 0 10 140 160 180 00 0 95% C.L. Upper Limit/SM Dijet Mass (GeV/c ) Expected Observed ± 1 σ ± σ total bkg. (before) total bkg. (after) ZH 1500 (before) ZH 1500 (after) M H (GeV/c ) 80 ST Data Z+h.f. 70 60 ZH 5 Z+l.f. 50 40 Diboson, misid. Z, & tt 30 0 0 0 0. 0.4 0.6 0.8 1 Projection of NN Output - b-jet energy estimated with a regression neural network, to improve dijet mass resolution - b-tagging with neural networks, used to compute the final limits Events / Bin 1 0 1 10 130 140 150 D 17 Events / Bin

MVA examples in HEP : Tevatron Photon identification at D0 and applications arxiv:0.4917v3 Fraction of events 0.35 0.3 0.5 0. 0.15 0.1 0.05 DØ, 4. fb + - l! Z->l! MC jet MC (l = e,µ) data 0 0.1 0. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 O NN - Neural network for Photon Id based on calorimeter energy deposit and track variables in an isolation cone around the photon - Used to identify and measure the diphoton+x cross-section (pb/gev) d"/dm!! Ratio to RESBOS - -3 1.5 1 0.5 DØ, 4. fb (a) data RESBOS DIPHOX PYTHIA PDF uncert. scale uncert. 50 0 150 00 50 300 350 (GeV) DØ, 4. fb M!! (c) 18

MVA examples in HEP : Tevatron H γγ searches at D0 DØ Note 6177-CONF Events/0.08 7 6 5 4 3 DØ preliminary, 8. fb data background signal (M =10GeV) x 50 H 00 180 160 140 10 0 80 60 40 0 0 0 0.1 0. 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 95% CL σ x BR(γγ)/SM value 80 70 60 50 40 30 DØ preliminary, 8. fb Observed limit Expected limit Expected limit ± 1 s.d. Expected limit ± s.d. 1 0 - -0.8-0.6-0.4-0. -0 0. 0.4 0.6 0.8 1 MVA output (c) M H = 10 GeV 0 0 1 10 130 140 150 [GeV] FIG. 5: 95% C.L. limits on the σ BR relative to the SM prediction as a function of Higgs mass. Th as a solid black line while the expected limit under the background-only hypothesis is shown as a da and yellow areas correspond to 1 and standard deviations (s.d.) around the expected limit. - Identify photons with the neural network (reduces fake photons processes) - Boosted decision tree with kinematic variables to improve the sensitivity against the diphoton continuum (+30%) - The BDT includes the invariant mass of the diphoton system as input Acknowledgments We thank the staff at Fermilab and collaborating institutions, and acknowledge support fr M γγ 19

MVA examples in HEP : LHC H WW llνν searches in CMS - 3 channels : 0-jet, 1-jet, -jet - Electron identification with a multivariate technique : 50% more background rejection for the same signal efficiency - Boosted decision tree in 0-jet and 1-jet channels : kinematic variables CMS-PAS-HIG1-04 40 0 data m H =130 WW W+jets Z+jets top WZ/ZZ CMS preliminary L = 4.6 fb 95% CL limit on!/! SM - Limits improved by using BDT 5 median expected expected ± 1! expected ±! observed CMS preliminary H " WW (cut based) L = 4.6 fb 95% CL limit on!/! SM 5 median expected expected ± 1! expected ±! observed CMS preliminary H " WW (BDT based) L = 4.6 fb 0-0.5 0 0.5 1 BDT Output 0 0 00 300 400 500 600 Higgs mass [GeV] 0 0 00 300 400 500 600 Higgs mass [GeV] 0

MVA examples in HEP : LHC H bb searches in CMS CMS-PAS-HIG1-031 95% C.L. Limit on!/! SM 16 14 1 8 CMS Preliminary, BDT analysis s = 7 TeV, L = 4.7 fb VH(bb), combined CL S Observed CL S Expected CL S Expected ± 1! CL S Expected ±! 95% C.L. Limit on!/! SM 1 115 10 15 130 135 Higgs Mass [GeV] Figure 4: Expected and observed 95% CL combined upper limits on the ratio of VHbb produc- 4 Data CMS Preliminary WH s = 7 TeV, L = 4.7 fb VV bb tion for the BDT (left) and M(jj)(right) analyses. The median expected W(µ!)H(bb) limit and the 1- and -σ W + udscg Z + bb - 5 channels : W eν,μν, Z ee,μμ, Z νν 3 Z + udscg Single Top bands are obtained with the LHC CLs method as implemented in RooStats, as are the observed tt QCD MC uncertainty Events 1-0.6-0.4-0. 0 0. 0.4 0.6 BDT output Events 1 4 C s 3 1-0.6 W C 16 14 1 8 CMS Preliminary, M JJ s = 7 TeV, L = 4.7 fb VH(bb), combined CL S Observed CL S Expected CL S Expected ± 1! CL S Expected ±! analysis 6 6 4 4 1 115 10 15 130 135 Higgs Mass [GeV] - Searches for VH, H bb limits at each mass point. - B-tagging selection on a likelihood discriminant (track impact parameter + secondary vertices information) - Boosted decision trees for the kinematics

MVA examples in HEP : LHC H γγ searches in CMS CMS-PAS-HIG1-030 ) Events / ( 1 GeV/c 100 00 800 600 400 00 CMS preliminary s = 7 TeV L = 4.76 fb All Categories Combined Data Bkg Model!1"! " 5xSM m H =10 GeV /#(H"!!) SM 95%CL #(H"!!) 5 4 3 Observed CLs Limit Median Expected CLs Limit " 1# Expected CLs " # Expected CLs CMS preliminary s = 7 TeV L = 4.76 fb 0 0 10 140 160 180 (GeV/c ) m!! 1 1!# SM 0 1 115 10 15 130 135 140 145 150 m H (GeV/c ) - Hard interaction vertex identified with a BDT using diphoton kinematics and track variables - Photon energy estimated with a BDT regression from geometry and energy deposit variables (% improvement on the limit)

MVA examples in HEP : LHC Combination of all channels in CMS CMS-PAS-HIG1-03 95% CL limit on!/! SM CMS Preliminary, Combined, L int s = 7 TeV = 4.6-4.7 fb Combined H " bb (4.7 fb ) H " ## (4.6 fb ) H " $$ (4.7 fb ) H " WW (4.6 fb ) H " ZZ (4.7 fb ) 95% CL limit on!/! SM 1 CMS Preliminary, s = 7 TeV Combined, L int = 4.6-4.7 fb Observed Expected ± 1! Expected ±! 1 0 00 300 400 500 600 Higgs boson mass (GeV/c ) 0 00 300 400 500 600 Higgs boson mass (GeV/c ) - Combination can be seen as a grand multivariate analysis - Limits are set with CLs method - Exclusion at 95% confidence level : 17-600 GeV 3

Plenty of multivariate methods... Example of MVA methods : - Rectangular cut optimization - Fisher - Likelihood - Neural network - Decision tree - Support Vector Machine -... Characteristics : - Level of complexity and transparency - Performance in term of background rejection - Way of dealing with non-linear correlations - Speed of training - Robustness while increasing the number of input variables - Robustness against overtraining 4

Rectangular cuts - Simplest multivariate method, very intuitive - All HEP analyses are using rectangular cuts, not always completely optimized Rectangular cuts optimization : - Grid search, Monte-Carlo sampling - Genetic algorithm - Simulated annealing Characteristics : - Difficult to discriminate signal from background if non-linear correlations - Optimization difficult to handle with high number of variables Define the signal region :! a1 < x1 < a,! b1 < x < b!... 5

Fisher discriminant Fisher method : - Cut on a linear combination of the input variables! y < a.x1 + b.x - This corresponds to an hyper-plan in the variable phase-space - Very efficient if linear correlations - Again, difficult to handle non-linear correlations - More easily trained than rectangular cuts 6

Likelihood estimator - The likelihood ratio is defined by : L S(B) (i) = n var k=1 p S(B),k (x k (i)) y L (i) = L S (i) L S (i)+l B (i) is the product of the probability function for each variables. - Optimal when no correlation between the variables - This likelihood method does not take into account the correlations and is therefore sub-optimal in presence of correlations - Refinements exist to take into account the correlations 7

Neural network - Most commonly used : the multi-layer perceptron - Composed of neurons taking as input a linear combination of the previous neuron outputs - Activation function (usually tanh) transforms the linear combination - Weights for each neurons are found during the training phase by minimizing the error on the neural network output Input Layer Hidden Layer Output Layer x 1 y 1 1 w 1 11 w 1 1 y 1 y w 11 - Neural networks are universal approximators : takes advantage of correlations x x 3 y 1 y 1 3 y 3 y 4 y 3 1 y ANN - Quite stable against overtraining and against increasing number of variables x 4 Bias y 1 4 1 w 1 45 w 1 05 Bias y 5 1 w 51 w 01 8

Decision tree - A decision tree is a binary tree : a sequence of cuts paving the phase-space of the input variables - Repeated yes/no decisions on each variables are taken for an event until a stop criterion is fulfilled - Trained to maximize the purity of signal nodes (or the impurity of background nodes) - Decision trees are extremely sensitive to the training samples, therefore to overtraining - To stabilize their performance, one uses different techniques : - Boosting - Bagging - Random forests 9

Support Vector Machine - Idea : build a hyperplane that separate signal and background vectors (events) using only a subset of all training vectors (support vectors) - Position of the hyperplane found by maximizing the margin between it and the support vectors - Higher dimensions spaces are used by non-linear transformation, using kernel functions such as the gaussian basis - SVM can be competitive with NN and BDT but is often less discriminant : often data are non-separable, therefore sensitive to all the SVM parameters - In some cases this method performs very well 30

Training and application Training / test samples - For all multivariate methods, two samples are used : - Training sample - Test sample - This is mandatory to check that the training has converged to a solution which does not depend on the statistical fluctuations of the training sample - Generally speaking, MVA should be applied (or tested) in events where the response is not known - Training is time-consuming, especially while increasing the number of variables (and depending on the method) - Application is usually faster : it uses a set of weights used in the MVA output computation 31

Which method to choose? From TMVA manual MVA METHOD CRITERIA Cuts Likelihood PDE- RS / k-nn PDE- Foam H- Matrix Fisher / LD MLP BDT Rule- Fit SVM Performance No or linear correlations Nonlinear correlations Speed Training Response Robust- Overtraining ness Weak variables Curse of dimensionality Transparency 3