Boosted Top Tagging with Deep Neural Networks

Boosted Top Tagging with Deep Neural Networks Jannicke Pearkes University of British Columbia, Engineering Physics Wojtek Fedorko, Alison Lister, Colin Gay Inter-Experimental Machine Learning Workshop March 22 nd, 2017

Overview Introduction Method Monte Carlo Samples Network architecture & training Results Preprocessing P T dependence Pileup dependence Learning what is being learnt Next Steps 2

Introduction Low top p T High top p T W boost W b Image: Emily Thompson b Train a deep neural network to discriminate between jets originating from top quarks and those originating from QCD background 3

Monte Carlo Samples Signal: Z to ttbar Background: Dijet Generated with PYTHIA v8.219 NNPDF23 LO AS 0130 QED PDF DELPHES v3.4.0 using default CMS card Jets clustered using DELPHES energy-flow objects Anti-k T jets selected with R = Trimming performed with k T algorithm and R = 0.2, p T frac = 5% Signal jets are selected where a truth top decays hadronically within ΔR= 0.75 of a large radius jet Jets are required to have η <= 2.0 Jets are subsampled to be flat in p T and signal-matched in eta Looking at jets with p T between 600-2500 GeV ~ 4 million signal jets and ~4 million background jets Sample divided into 80%, 10%, 10% for training, validation and testing 4

Examples of Jet Images Signal jet with pt =781GeV 10 0 Signal jet with pt =1480GeV 10 0 Signal jet with pt =2358GeV 10 0 0.5 10 1 0.5 10 1 0.5 10 1 Translated azimuthal angle 0.0 0.5 10 2 10 3 Jet pt per pixel [GeV] Translated azimuthal angle 0.0 0.5 10 2 10 3 Jet pt per pixel [GeV] Translated azimuthal angle 0.0 0.5 10 2 10 3 Jet pt per pixel [GeV] 0.5 0.0 0.5 Translated pseudorapidity 10 4 0.5 0.0 0.5 Translated pseudorapidity 10 4 0.5 0.0 0.5 Translated pseudorapidity 10 4 Background jet with pt =702GeV 10 0 Background jet with pt =1370GeV 10 0 Background jet with pt =2376GeV 10 0 0.5 10 1 0.5 10 1 0.5 10 1 Translated azimuthal angle 0.0 10 2 Jet pt per pixel [GeV] Translated azimuthal angle 0.0 10 2 Jet pt per pixel [GeV] Translated azimuthal angle 0.0 10 2 Jet pt per pixel [GeV] 0.5 10 3 0.5 10 3 0.5 10 3 0.5 0.0 0.5 Translated pseudorapidity 10 4 0.5 0.0 0.5 Translated pseudorapidity 10 4 0.5 0.0 0.5 Translated pseudorapidity 10 4 Jet images are typically very sparse roughly 5-10% pixel activation on average if using a 0.1x0.1 grid [1] [1] L. de Oliveira, M. Kagan, L. Mackey, B. Nachman, and A. Schwartzman, Jet-images -- deep learning edition, JHEP 07 (2016) 069, arxiv:1515190 [hep-ph]. 5

Neural Network Inputs Use sequence of jet constituents rather than image Advantages: No loss of information due to pixelization in an image Inputs are more information dense Using 120 constituents average activation is 30%-50% 6

Training and Network Architecture Network Type Number of layers Number of free parameters Activation function Optimizer Loss Fully connected 5, [300,150,50,10,5,1] 41,323 Rectified linear units, sigmoid on output Adam Binary Cross-Entropy Early Stopping Patience of 5 Implemented with Keras Initially planned on using an LSTM, but ended up using a fully connected network We found that performance between the LSTM and the fully connected network was very similar, but the deep networks were much faster to train (~10 times) which allowed for faster experimentation with preprocessing techniques and network architectures 7

Preprocessing

Preprocessing Large radius, R =, jets are trimmed using subjets R = 0.2 found with the k T algorithm with and p T frac = 5% Order subjets by subjet p T and jet constituent p T within each subjet We use only the 120 highest p T jet constituents Perform preprocessing using domain knowledge about the physics at hand 9

No Preprocessing 10 3 Jet p T =600-2500GeV Trimming only Background Rejection 10 2 10 1 Trimming only AUC = 0.83 R ϵ = 50% = 8.85 R ϵ = 80% = 3.36 10 0 0.0 0.2 0.4 0.6 0.8 Top Tagging Efficiency 10

Scale Scale p T of all jet constituents by a common factor to ensure that the constituent p T is approximately between 0 and 1 11

Scale 10 3 Jet p T =600-2500GeV Trimming only Scale Background Rejection 10 2 10 1 Scaling AUC = 0.900 R ϵ = 50% = 21.3 R ϵ = 80% = 6.02 10 0 0.0 0.2 0.4 0.6 0.8 Top Tagging Efficiency 12

Translate Center jet about highest p T subjet in η, φ plane 13

Translate 10 3 Jet p T =600-2500GeV Trimming only Scale Translation Background Rejection 10 2 10 1 Translation AUC = 0.924 R ϵ = 50% = 33.2 R ϵ = 80% = 8.48 10 0 0.0 0.2 0.4 0.6 0.8 Top Tagging Efficiency 14

Rotate Designed method of rotations to preserve jet mass Transform p ', η, φ into p ), p *,, p + Rotate so that second highest p T subjet is aligned with negative y-axis: Transform (p ), p *,, p + ) back to p ', η, φ 15

Rotate 10 3 Jet p T =600-2500GeV Trimming only Scale Translation Rotation Background Rejection 10 2 10 1 Rotation AUC = 0.932 R ϵ = 50% = 42.3 R ϵ = 80% = 9.57 10 0 0.0 0.2 0.4 0.6 0.8 Top Tagging Efficiency 16

Flip Third subjet is not constrained, but can be moved to right half of plane Flip jet if average p T is in left half of plane 17

Flip Background Rejection 10 3 10 2 10 1 Flip AUC = 0.933 R ϵ = 50% = 44.3 R ϵ = 80% = 9.75 Jet p T =600-2500GeV Trimming only Scale Translation Rotation Flip 10 0 0.0 0.2 0.4 0.6 0.8 Top Tagging Efficiency 18

Performance on Truth vs Reconstructed Jets

Performance after preprocessing 10 3 Jet p T =600-2500GeV DNN, truth 32,truth DNN, reco 32, reco Background Rejection 10 2 10 1 10 0 0.0 0.2 0.4 0.6 0.8 Top Tagging Efficiency 20

Performance at 50% overall Signal Efficiency Truth Jets Reconstructed Jets 80 60 0.8 70 0.8 50 Signal efficiency 0.6 0.4 0.2 0.0 Signal efficiency Background rejection 600 800 1000 1200 1400 1600 1800 2000 2200 2400 Jet p T [GeV] 60 50 40 30 20 10 0 Background rejection Signal efficiency 0.6 0.4 0.2 0.0 Signal efficiency Background rejection 600 800 1000 1200 1400 1600 1800 2000 2200 2400 Jet p T [GeV] 40 30 20 10 0 Background rejection AUC = 0.947 R ϵ = 50% = 66 R ϵ = 80% = 13 AUC = 0.933 R ϵ = 50% = 44 R ϵ = 80% = 9.7 21

Pileup

Performance at different levels of pileup 10 3 Jet p T =600-2500GeV No pile up Pile up = 23 Pile up = 50 Background Rejection 10 2 10 1 10 0 0.0 0.2 0.4 0.6 0.8 Top Tagging Efficiency Extremely stable performance with respect to pileup 23

Performance at different levels of pileup 60 0.8 50 Signal efficiency 0.6 0.4 40 30 20 Background rejection 0.2 0.0 Signal efficiency: No pile up Signal efficiency: Pile up = 23 Signal efficiency: Pile up = 50 600 800 1000 1200 1400 1600 1800 2000 2200 2400 Jet p T [GeV] 24 Background rejection: No pile up Background rejection: Pile up = 23 Background rejection: Pile up = 50 p T dependence also stable with respect to pileup 10 0

Learning what is being learnt

Jet Mass Background Jets 0.120 0.014 Flat p T distribution 600 < jet p T < 2500 GeV DNN output 0.8 0.6 0.4 0.2 0.105 0.090 0.075 0.060 0.045 0.030 0.015 P(Jet mass [GeV] DNN output) 0.012 0.010 0.008 0.006 0.004 0.002 Signal Background 0.0 0 100 200 300 400 500 Jet mass [GeV] 0.000 0.000 0 50 100 150 200 250 300 350 400 Jet mass [GeV] 26

Next Steps Short term: We plan to revisit LSTMs Thorough Bayesian hyper-parameter optimization Longer term: Both top and W tagging with deep neural networks now reasonably well-established on Monte Carlo But does it work on data? Start working towards evaluating the performance of these techniques on data Investigate effects of systematics and strategies for mitigating the impact of systematics 28

Thank you! 29

W-tagging performance on truth QCD-Aware Recursive Neural Networks for Jet Physics. Louppe, Cho, Becot, Cranmer https://arxiv.org/abs/1702.00748 30

Zooming Parton Shower Uncertainties in Jet Substructure Analyses with Deep Neural Networks Barnard, Dawe, Dolan, Rajcic https://arxiv.org/pdf/1609.00607v2.pdf 31

Performance when trained and tested on different levels of pileup 60 60 0.8 50 0.8 50 Signal efficiency 0.6 0.4 0.2 0.0 Signal efficiency: NN trained on µ = 0 tested on µ =0 Signal efficiency: NN trained on µ = 0 tested on µ =23 Signal efficiency: NN trained on µ = 0 tested on µ =50 Background rejection: NN trained on µ = 0 tested on µ =0 Background rejection: NN trained on µ = 0 tested on µ =23 Background rejection: NN trained on µ = 0 tested on µ =50 600 800 1000 1200 1400 1600 1800 2000 2200 2400 Jet p T [GeV] - Examined how a neural network trained at one pileup level performs on another level of pileup - NN seems relatively robust to changes in pileup expected at the LHC in the next few years 40 30 20 10 0 Background rejection Signal efficiency Signal efficiency 0.6 0.4 0.2 0.0 0.8 0.6 0.4 0.2 0.0 Signal efficiency: NN trained on µ = 23 tested on µ =0 Signal efficiency: NN trained on µ = 23 tested on µ =23 Signal efficiency: NN trained on µ = 23 tested on µ =50 Background rejection: NN trained on µ = 23 tested on µ =0 Background rejection: NN trained on µ = 23 tested on µ =23 Background rejection: NN trained on µ = 23 tested on µ =50 600 800 1000 1200 1400 1600 1800 2000 2200 2400 Jet p T [GeV] Signal efficiency: NN trained on µ = 50 tested on µ =0 Signal efficiency: NN trained on µ = 50 tested on µ =23 Signal efficiency: NN trained on µ = 50 tested on µ =50 Background rejection: NN trained on µ = 50 tested on µ =0 Background rejection: NN trained on µ = 50 tested on µ =23 Background rejection: NN trained on µ = 50 tested on µ =50 600 800 1000 1200 1400 1600 1800 2000 2200 2400 Jet p T [GeV] 40 30 20 10 0 60 50 40 30 20 10 0 Background rejection Background rejection 32

32 Background Jets 0.8 0.040 0.035 0.030 2.5 2.0 Flat p T distribution 600 < jet p T < 2500 GeV Signal Background DNN output 0.6 0.4 0.025 0.020 0.015 DNN output) P( 32 wta 1.5 0.2 0.010 0.5 0.005 0.0 0.0 0.2 0.4 0.6 0.8 32 wta 0.000 0.0 0.0 0.2 0.4 0.6 0.8 1.2 32 34