Optimization of tau identification in ATLAS experiment using multivariate tools

Similar documents
Reconstruction and identification of hadronic τ decays with ATLAS

Tau Performance. ATLAS UK Physics Meeting. Alan Phillips University of Cambridge. ( on behalf of the Tau Working Group)

Measurement of the Z ττ cross-section in the semileptonic channel in pp collisions at s = 7 TeV with the ATLAS detector

Support Vector Machine (SVM) and Kernel Methods

ATLAS-CONF October 15, 2010

Standard Model physics with taus in ATLAS

Physics with Tau Lepton Final States in ATLAS. Felix Friedrich on behalf of the ATLAS Collaboration

Perceptron Revisited: Linear Separators. Support Vector Machines

Statistical Tools in Collider Experiments. Multivariate analysis in high energy physics

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

Jeff Howbert Introduction to Machine Learning Winter

Reconstruction and identification of hadronic tau-decays in ATLAS

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machine (continued)

Linear & nonlinear classifiers

Performance of muon and tau identification at ATLAS

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

ATLAS Calorimetry (Geant)

Support Vector Machines II. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Support'Vector'Machines. Machine(Learning(Spring(2018 March(5(2018 Kasthuri Kannan

The ATLAS muon and tau triggers

Atlas Status and Perspectives

Electroweak Physics at the Tevatron

Chapter 9. Support Vector Machine. Yongdai Kim Seoul National University

Characterization of Jet Charge at the LHC

SUSY Search Strategies at Atlas and CMS

Lecture 10: A brief introduction to Support Vector Machine

Support Vector Regression (SVR) Descriptions of SVR in this discussion follow that in Refs. (2, 6, 7, 8, 9). The literature

Higgs Searches with taus in the final state in ATLAS

Support Vector Machines.

Support Vector Machines. Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar

Support Vector Machines

Particle Identification at LHCb. IML Workshop. April 10, 2018

Application of the Tau Identification Capability of CMS in the Detection of Associated Production of MSSM Heavy Neutral Higgs Bosons Souvik Das

Support Vector Machine & Its Applications

Review: Support vector machines. Machine learning techniques and image analysis

Support Vector Machines Explained

hep-ex/ May 1999

Introduction to SVM and RVM

La ricerca dell Higgs Standard Model a CDF

Applied Machine Learning Annalisa Marsico

Advanced statistical methods for data analysis Lecture 1

Reconstruction of tau leptons and prospects for SUSY in ATLAS. Carolin Zendler University of Bonn for the ATLAS collaboration

Neural Networks. Prof. Dr. Rudolf Kruse. Computational Intelligence Group Faculty for Computer Science

Recent Results on New Phenomena and Higgs Searches at DZERO

Classifier Complexity and Support Vector Classifiers

Support Vector Machines

Support Vector Machine (SVM) and Kernel Methods

Support Vector Machines: Maximum Margin Classifiers

Tau ID at ATLAS. Penn Tau Workshop. Ryan D. Reece. June 2, University of Pennsylvania. Ryan D. Reece (Penn) Tau ID at ATLAS

Pile-up effects on tau reconstruction

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

Multivariate Methods in Statistical Data Analysis

Multivariate statistical methods and data mining in particle physics

CMS Note Mailing address: CMS CERN, CH-1211 GENEVA 23, Switzerland

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

BSM Higgs in ATLAS and CMS

Christian Wiel. Reconstruction and Identification of Semi-Leptonic Di-Tau Decays in Boosted Topologies at ATLAS

Higgs and Z τ + τ in CMS

Particle Identification in the NA48 Experiment Using Neural Networks. L. Litov University of Sofia

Introduction to Support Vector Machines

Machine Learning. Lecture 6: Support Vector Machine. Feng Li.

Linear, threshold units. Linear Discriminant Functions and Support Vector Machines. Biometrics CSE 190 Lecture 11. X i : inputs W i : weights

Linear & nonlinear classifiers

Linear classifiers Lecture 3

Support Vector Machines

First physics with the ATLAS and CMS experiments. Niels van Eldik on behalf of the ATLAS and CMS collaborations

Support Vector Machines

Linear Classification and SVM. Dr. Xin Zhang

Tau Lepton Identification at the ATLAS Experiment

THE MULTIPLICIY JUMP! FINDING B S IN MULTI-TEV JETS W/O TRACKS

THE ATLAS TRIGGER SYSTEM UPGRADE AND PERFORMANCE IN RUN 2

ML (cont.): SUPPORT VECTOR MACHINES

Risultati dell esperimento ATLAS dopo il run 1 di LHC. C. Gemme (INFN Genova), F. Parodi (INFN/University Genova) Genova, 28 Maggio 2013

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Outline. Basic concepts: SVM and kernels SVM primal/dual problems. Chih-Jen Lin (National Taiwan Univ.) 1 / 22

Jet reconstruction in LHCb searching for Higgs-like particles

W vs. QCD Jet Tagging at the Large Hadron Collider

Heavy MSSM Higgses at the LHC

Expected Performance of the ATLAS Inner Tracker at the High-Luminosity LHC

A Measurement Of WZ/ZZ ll+jj Cross Section. Rahmi Unalan Michigan State University Thesis Defense

UVA CS / Introduc8on to Machine Learning and Data Mining. Lecture 9: Classifica8on with Support Vector Machine (cont.

Boosted decision tree performance studies in the t t τ + µ analysis at ATLAS

Support Vector Machines for Classification: A Statistical Portrait

HL-LHC Physics with CMS Paolo Giacomelli (INFN Bologna) Plenary ECFA meeting Friday, November 23rd, 2012

ATLAS: Status and First Results

Introduction to Support Vector Machines

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Statistical Pattern Recognition

Non-collision Background Monitoring Using the Semi-Conductor Tracker of ATLAS at LHC

Higgs searches in TauTau final LHC. Simone Gennai (CERN/INFN Milano-Bicocca) Mini-workshop Higgs Search at LHC Frascati, March 28th, 2012

Performance of BDTs for Electron Identification

Physics at Hadron Colliders

VBF SM Higgs boson searches with ATLAS

Light Higgs Discovery Potential with ATLAS, Measurements of Couplings and

Future prospects for the measurement of direct photons at the LHC

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

Design of the new ATLAS Inner Tracker for the High Luminosity LHC era

Searches for long-lived particles at CMS

Transcription:

Optimization of tau identification in ATLAS experiment using multivariate tools Marcin Wolter a, Andrzej Zemła a,b a Institute of Nuclear Physics PAN, Kraków b Jagiellonian University, Kraków ACAT 2007, Amsterdam, 23-27 April 2007 M. Wolter, A. Zemła 1

Higgs Processes : Standard Model Higgs (VBF,ttH) qqh qqττ, tth ttττ MSSM Higgs (A/H, H + ) A/H ττ, H + τν Exotic Processes : SUSY signature with τ s in final state Extra dimensions new theories (?) Standard Model : Z ττ W τν τ Physics processes with τ s important for first physics data analysis τ jet τ decay π 0 π + π + π - Observed: τ jets characteristics well-collimated calorimeter cluster with a small number of associated charged tracks main background QCD jets Two algorithms used by ATLAS for τ jet reconstruction: TauRec - cluster based Tau1P3P track-based M. Wolter, A. Zemła 2

Tau1p3p algorithm 3 Prong events Algorithm for τ hadronic jet reconstruction and identification Starts from a good quality hadronic track (p T > 9 GeV) Find nearby good quality tracks (p T >1 GeV, ΔR<0.2) 1 track + n π 0 (1 Prong candidate) 3 tracks + n π 0 (3 Prong candidate) 2 tracks + n π 0 (2 Prong candidate) 3P and one track lost 1P and one fake Signal and background separation Based on tracking and calo information 2 Prong & 3 Prong 11 id. variables 1 Prong -- 9 id. variables None of them alone provides a good separation Smart (multivariate) analysis needed! M. Wolter, A. Zemła 3

Multivariate tau identification Cuts - optimized with TMVA (Toolkit for MultiVariate Analysis integrated with ROOT, http://tmva.sf.net). Decorrelation -> many random cuts applied -> optimal set of cuts chosen. PDE_RS Probability Density Estimator with Range Searches. Neural network Stuttgart Neural Network Simulator (SNNS), feed-forward NN with two hidden layers. Trained network converted to the C code very fast. Support Vector Machine new tool in the TMVA package (implemented by us), now for the first time applied to the data analysis. Analysis performed on ATLAS MC data. Signal: Z->ττ, W->τν(had) Background: QCD jet events M. Wolter, A. Zemła 4

PDE_RS method *method published by T. Carli, B. Koblitz, NIM A 501 (2003) 576-588, used by HERA Parzen estimation (1960s) approximation of the unknown probability as a sum of kernel functions placed at the points x n of the training sample. Make it faster - count signal (n s ) and background (n b ) events in N-dim hypercube around the event classified only few events from the training sample needed (PDE_RS). Hypercube dimensions are free parameters to be tuned. Discriminator D(x) given by signal and background event densities: n S D x = n S n b Events stored in the binary tree easy and fast finding of neighbor events. Implementation developed by E. Richter-Wąs & L. Janyst M. Wolter, A. Zemła 5

Support Vector Machine (SVM) Basic ideas: Algorithm learning on examples. Builds a separating hyperplane using the minimal set of vectors from the training sample (Support Vectors). Is able to model any arbitrary function. Functionally similar to Neural Networks. History: Early 1960-ies linear support vector discriminants developed for pattern recognition (Vapnik & Lerner 1963, Vapnik & Czervonenkis 1964). Early 1990-ies non-linear extension of the SVM method (Boser 1992, Cortes & Vapnik 1995) 1995: regression SVM to estimate the continuous function (Vapnik 1995) M. Wolter, A. Zemła 6

Linear SVM classifier y i =+1 y i =-1 Point defining the margin: support vectors Separable data y i =f(x,w,b) = sign(<w, x> - b) Margin maximization: Intuitively the best. Not sensitive for small errors. Separation defined by support vectors only. 1 y i w x i b 1 0 margin= 2 w 2 Non-separable data Non-separable data - minimalizuj : 1 C additional penalty function: 2 w 2 i Quadratic programming problem a single minimum i 3 M. Wolter, A. Zemła 7

Non-linear extension Transformation from the input space into the high-dimensional feature space makes the LINEAR separation by a hyperplane possible: x :input R n R N N n The classifier depends only on the scalar products of the vectors: <x i,x j >. Therefore we do not need to know explicitly the function Ф, it is sufficient to know the kernel function K: x i, x j x i, x j = K x i, x j Commonly used kernel: polynomial K x i, x j = x i, x j c d sigmoid K x i, x j =tanh x i, x j Gaussian K x i, x j =exp 1 2 2 x i x j 2 M. Wolter, A. Zemła 8

Why SVM? Advantages: Statistically well motivated. Minimization is a quadratic programming problem, always finds minimum. Only two parameters to tune despite the dimensionality (in case of Gaussian kernel). Insensitive for overtraining. Limitations: Slow training (compared to neural network) due to computationally intensive solution to QP problem. We have used J. Platt's SMO (Sequential Minimal Optimization) to speed it up. For complex problems even 50% of input vectors can be the support vectors. M. Wolter, A. Zemła 9

Discriminant distributions PDE_RS Preliminary Neural Network SVM 3 Prong events Good signal and background separation. M. Wolter, A. Zemła 10

Bkg. rejection vs. signal efficiency ROC curve (identification only) 1-eff bkg 15 GeV < E T < 25 GeV Cut analysis Preliminary Algorithms trained on a subsample of all data, tested on the remaining events. Effic NN, PDE_RS and SVM significantly better than cuts. SVM is trained on a small subsample of all MC events (few thousands) good performance with a limited training sample Best performance achieved with Neural Networks. All MV methods give similar performance are we close to the Bayesian limit? M. Wolter, A. Zemła 11

Summary Tau1p3p identification significantly improved by using multivariate analysis tools. All of the presented classification methods are performing well: Cuts fast, robust, transparent for users. Neural network best performance, very fast classification while converted to the C function after training. PDE_RS robust and transparent for users, but large samples of reference candidates needed, classification slower than for other methods. Support Vector Machine not commonly used in HEP, new implementation of the algorithm (beta release). It works, gives competitive results. Tests show, that only small data samples are needed for training. M. Wolter, A. Zemła 12

Backup slides M. Wolter, A. Zemła 13

ATLAS (A Toroidal LHC ApparatuS) Muon Detectors: fast response for trigger, good p resolution Energy-scale: e/γ ~0.1% muons ~0.1% Jets ~1% Inner Detector: high efficiency tracking, good impact parameter resolution Electromagnetic Calorimeters: excellent e/γ identification, E and angular resolution, response uniformity Hadron Calorimeters: Good jet and E T miss performance 40MHz beam crossing Readout: 160M channels (3000 km cables) Raw data = 320Mbyte/sec (1TB/hour) M. Wolter, A. Zemła 14

Ntrack spectrum How candidates are defined: tau1p, tau2p, tau3p tau1p (Ntrack=1): exactly one leading good quality track, no more good quality tracks within R core tau2p (Ntrack=2): exactly two good quality tracks within R core, one of them leading, no more associated tracks within R core ; by definition E T eflow not correct here as one track is missing for true three-prong. tau3p(ntrack=3): exactly three good quality tracks within R core (one of them leading) exactly two good quality tracks within R core (one of the leading) + one non-qualif track within R core From Elzbieta's talk 22.08.2006 tau3p(ntrack=4,5): exactly three good quality tracks within R core, M. Wolter, A. Zemła 15

Identification variables Few new quantities introduced, full set given below: tracking part: number of associated tracks in isolation cone (NEW!) weighted width of track with respected to tau axis (for tau2p,tau3p only) (NEW!) invariant mass of tracks (for tau2p, tau3p only) (NEW!) calorimetry part: number of hitted strips energy weighted width in strips fraction of energy in half core cone to energy in core cone energy weighted radius in EM part calo+tracking part fraction of energy in isolation cone to energy in core cone, at EM scale invariant mass calculated from energy-flow (NEW!) ratio of energy in HAD part to energy of tracks Common vertex or life-time information not used yet. Electron-track and muon-track veto still from truth. From Elzbieta's talk 22.08.2006 M. Wolter, A. Zemła 16

Implementation of NN The trained network is converted into C function and integrated to the ATHENA package. To get a flat acceptance in E t vis and flat distribution of NN output the compensation procedure is applied. The compensation function is added to the NN function code. Two variables are returned by the function: DiscriNN raw NN output EfficNN -- NN output with compensation Corrected efficiency Corrected NN output M. Wolter, A. Zemła 17

Support Vector Machine M. Wolter, A. Zemła 18

Support Vector Machines Relatively new algorithm Basic idea: build a separating hyperplane using the minimal number of vectors from the training sample (Support Vectors). should be able to model any arbitrary function. Functionally algorithm similar to Neural Network M. Wolter, A. Zemła 19

Support Vector Machine Early sixties: support vectors method developed for pattern recognition using the separating hyperplane (Vapnik and Lerner 1963, Vapnik and Czervonenkis 1964). Early 1990: method extended to the nonlinear separation (Boser 1992, Cortes and Vapnik 1995) 1995: further development to perform regression (Vapnik 1995) M. Wolter, A. Zemła 20

Linear classifier x f α y est means y i =+1 f(x,w,b) = sign(w. x - b) means y i =-1 Support Vectors -points defining the margin. i y i w x i b 1 0 w x b=0 straight line margin= 2 w minimize w 2 Maximum margin linear classifier Intuitionally is the best. Not sensitive on errors in the hyperplane position. Separation depends on support vectors only. Works in practice! Maximize margin = minimize w 2 M. Wolter, A. Zemła 21

If data are not separable Additional slack variable: i =0 i =distance And we get: y i w x i b 1 i 0 i x i corectly classified x i incorrectly classified means +1 means -1 e 3 For the linear classifier: w x b=0 minimize : 1 2 w 2 C i i C is a free parameter - cost variable. e 1 e 2 M. Wolter, A. Zemła 22

Problem data separable but non-linear Transformation to higher dimensional feature space, where data ARE linearly separable. In this example data separable by elliptic curve in R 2 are linearly separable in R 3. We need a universal method of transformation to the higher dimensional space! M. Wolter, A. Zemła 23

Kernel trick Margin w maximization using Laplace multipliers i : W = i Margin depends on dot products x i *x j only. Function Ф transforms the input space to the higher dimensional feature space, where data are linearly separable: x :input R n R N N n We can replace: x i, x j x i, x j =K x i, x j The resulting algorithm is formally similar, except that every dot product is replaced by a non-linear kernel function. This allows the algorithm to fit the maximum-margin hyperplane in the transformed feature space. The transformation may be non-linear and the transformed space high dimensional; thus though the classifier is a hyperplane in the highdimensional feature space it may be non-linear in the original input space. W = i i 1 2 i i 1 2 i No need to know function Ф(x), enough to know kernel K(x i,x j ). j j i j y i y j x i, x j i j y i y j K x i, x j M. Wolter, A. Zemła 24

Commonly used kernels Must be symmetric: K x i, x j =K x j, x i If the kernel used is a radial base function (Gaussian) the corresponding feature space is a Hilbert space of infinite dimension. Maximum margin classifiers are well regularized, so the infinite dimension does not spoil the results. M. Wolter, A. Zemła 25

Regression ε insensitive loss +ε L(f,y) ε We have input data: X = {(x 1,d 1 ),., (x N,d N )} We want to find f(x), which has small deviation from d and which is maximally smooth. Define a cost function: ε Minimize: And repeat the kernel trick M. Wolter, A. Zemła 26

Non-linear Kernel example Gaussian kernel M. Wolter, A. Zemła 27

SVM and feed-forward neural network A comparison NN complexity controlled by a number of nodes. SVM complexity doesn't dependent on dimensionality. NN can fall into local minima. SVM minimization is a quadratic programming problem, always finds minimum. SVM discriminating hyperplane is constructed in a high dimensionality space using a kernel function. Jianfeng Feng, Sussex University M. Wolter, A. Zemła 28

SVM strength Statistically well motivated => Can get bounds on the error, can use the structural risk minimization (theory which characterizes generalization abilities of learning machines). Finding the weights is a quadratic programming problem - guaranteed to find a minimum of the error surface. Thus the algorithm is efficient and SVM generates near optimal classification and is quite insensitive to overtraining. Obtain good generalization performance due to high dimension of the feature space. Jianfeng Feng, Sussex University M. Wolter, A. Zemła 29

SVM weakness Slow training (compared to neural network) due to computationally intensive solution to QP problem especially for large amounts of training data => need special algorithms. Slow classification for the trained SVM. Generates complex solutions (normally > 60% of training points are used as support vectors), especially for large amounts of training data. E.g. from Haykin: increase in performance of 1.5% over MLP. However, MLP used 2 hidden nodes, SVM used 285 Difficult to incorporate prior knowledge. Jianfeng Feng, Sussex University M. Wolter, A. Zemła 30

Applications in physics While neural networks are quite commonly used, SVM are rarely applied. LEP compared NN and SVM Classifying LEP Data with Support Vector Algorithms P. Vannerema K.-R. Muller B. Scholkopf A. Smola S. Soldner- Rembold There were some tries in D0 and CDF (Tufts Group top quark identification) SVM works well in other areas (for example handwriting recognition). One should look carefully, the method might be worth trying! M. Wolter, A. Zemła 31