Statistical Methods in Particle Physics

Similar documents
Advanced statistical methods for data analysis Lecture 2

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Multivariate statistical methods and data mining in particle physics

Statistical Methods for Particle Physics Lecture 2: multivariate methods

Statistical Methods for Particle Physics Lecture 2: statistical tests, multivariate methods

Applied Statistics. Multivariate Analysis - part II. Troels C. Petersen (NBI) Statistics is merely a quantization of common sense 1

Machine Learning Lecture 5

Multivariate Methods in Statistical Data Analysis

Advanced statistical methods for data analysis Lecture 1

Machine Learning Lecture 7

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Multivariate Analysis Techniques in HEP

Statistical Tools in Collider Experiments. Multivariate analysis in high energy physics

FINAL: CS 6375 (Machine Learning) Fall 2014

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning Lecture 10

Lecture 3: Pattern Classification

Linear & nonlinear classifiers

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Statistical Data Analysis Stat 2: Monte Carlo Method, Statistical Tests

Multivariate Data Analysis and Machine Learning in High Energy Physics (III)

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Ch 4. Linear Models for Classification

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

Neural Networks and Deep Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

PATTERN CLASSIFICATION

Harrison B. Prosper. Bari Lectures

Brief Introduction to Machine Learning

CSCI-567: Machine Learning (Spring 2019)

Holdout and Cross-Validation Methods Overfitting Avoidance

Boosted Decision Trees and Applications

Intelligent Systems Discriminative Learning, Neural Networks

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Multilayer Neural Networks

L11: Pattern recognition principles

Lecture 3: Pattern Classification. Pattern classification

Linear discriminant functions

Lecture 5: Logistic Regression. Neural Networks

Linear & nonlinear classifiers

Voting (Ensemble Methods)

Learning with multiple models. Boosting.

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Classification: The rest of the story

Analysis Techniques Multivariate Methods

Artificial Neural Networks. MGS Lecture 2

Multivariate Data Analysis Techniques

Numerical Learning Algorithms

Neural Networks and Ensemble Methods for Classification

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Mining Classification Knowledge

Introduction to Machine Learning

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Support Vector Machines

Neural Networks DWML, /25

Chapter 14 Combining Models

Hypothesis testing:power, test statistic CMS:

Introduction to Machine Learning Midterm Exam

Pattern Recognition and Machine Learning

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Artificial Intelligence Roman Barták

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Stochastic Gradient Descent

5. Discriminant analysis

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

10-701/ Machine Learning - Midterm Exam, Fall 2010

STA 4273H: Statistical Machine Learning

MLPR: Logistic Regression and Neural Networks

CS534 Machine Learning - Spring Final Exam

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

MODULE -4 BAYEIAN LEARNING

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /9/17

The exam is closed book, closed notes except your one-page cheat sheet.

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

12 - Nonparametric Density Estimation

Reconnaissance d objetsd et vision artificielle

Multi-layer Neural Networks

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Machine Learning Practice Page 2 of 2 10/28/13

COMS 4771 Introduction to Machine Learning. Nakul Verma

10-701/ Machine Learning, Fall

Mining Classification Knowledge

1 Machine Learning Concepts (16 points)

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Neural networks (not in book)

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Hierarchical Boosting and Filter Generation

Multilayer Neural Networks

Machine Learning. Nonparametric Methods. Space of ML Problems. Todo. Histograms. Instance-Based Learning (aka non-parametric methods)

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

StatPatternRecognition: A C++ Package for Multivariate Classification of HEP Data. Ilya Narsky, Caltech

PATTERN RECOGNITION AND MACHINE LEARNING

Transcription:

Statistical Methods in Particle Physics 8. Multivariate Analysis Prof. Dr. Klaus Reygers (lectures) Dr. Sebastian Neubert (tutorials) Heidelberg University WS 2017/18

Multi-Variate Classification Consider events which can be either signal or background events. Each event is characterized by n observables: ~x =(x 1,...,x n ) "feature vector" Goal: classify events as signal or background in an optimal way. This is usually done by mapping the feature vector to a single variable, i.e., to scalar test statistic: R n! R : y(~x) A cut y > c to classify events as signal corresponds to selecting a potentially complicated hyper-surface in feature space. In general superior to classical "rectangular" cuts on the xi. Problem closely related to machine learning (pattern recognition, data mining, ) 2

Classification: Different Approaches rectangular cuts non linear linear H 0 k-nearest-neighbor, Boosted Decision Trees, Multi-Layer Perceptrons, Support Vector Machines G. Cowan': https://www.pp.rhul.ac.uk/~cowan/stat_course.html 3

Signal Probability Instead of Hard Decisions Example: test statistic y for signal and background from a Multi-Layer Perceptron (MLP): Normalized TMVA output for classifier: MLP 7 6 5 4 3 2 1 0 Signal Background 0.2 0.4 0.6 0.8 1 Instead of a hard yes/no decision one can also define the probability of an event to be a signal event: P s (y) P(S y) = p(y B) TMVA manual p(y S) MLP U/O-flow (S,B): (0.0, 0.0)% / (0.0, 0.0)% Normalized TMVA output for classifier: BDT 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 Signal Background -0.8-0.6-0.4-0.2-0 0.2 Figure 4: Example plots for classifier output distributions for signal and background event test sample. Shown are likelihood (upper left), PDE range search (upper right), Multilay lower left) and boosted decision trees. p(y S) f s p(y S) f s + p(y B) (1 f s ), f s = n s n s + n b TMVA tutorial: https://twiki.cern.ch/twiki/bin/view/tmva. An up-to-date reference of all configuration options for the TMVA Factory, 4 the MVA methods: http://tmva.sourceforge.net/optionref.html.

ROC Curve Quality of the classification can be characterized by the receiver operating characteristic (ROC curve) Background rejection versus Signal efficiency 1 εb εb: background efficiency Background rejection 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 good MVA Method: Fisher MLP BDT PDERS Likelihood better 0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Figure 5: Example for the background rejection versus signal e on the classifier outputs for the events of the test sample. Signal efficiency ciency ( ROC curve ) obtained by cutting 5

Different Approaches to Classification Neyman-Pearson lemma states that likelihood ratio provides an optimal test statistic for classification: y(~x) = p(~x S) p(~x B) Problem: the underlying pdf's are almost never known explicitly. Two approaches: 1. Estimate signal and background pdf's and construct test statistic based on Neyman-Pearson lemma, e.g. Naïve Bayes classifier (= Likelihood classifier) 2. Decision boundaries determined directly without approximating the pdf's (linear discriminants, decision trees, neural networks, ) 6

Supervised Machine Learning (I) Supervised Machine Learning requires labeled training data, i.e., a training sample where for each event it is known whether it is a signal or background event Decision boundary defined by minimizing a loss function ("training") Bias-variance tradeoff Classifiers with a small number of degrees of freedom are less prone to statistical fluctuations: different training samples would result in a similar classification boundaries ("small variance") However, if the data contain features that a model with few degrees of freedom cannot describe, a bias is introduced. In this case a classifier with more degrees of freedom would be better. User has to find a good balance 7

Supervised Machine Learning (II) Training, validation, and test sample Decision boundary fixed with training sample Performance on training sample becomes better with more iterations Danger of overtraining: Statistical fluctuations of the training sample will be learnt Validation sample = independent labeled data set not used for training check for overtraining Sign of overtraining: performance on validation sample becomes worse Stop training when signs of overtraining are observed ("early stopping") Performance: apply classifier to independent test sample Often: test sample = validation sample (only small bias) 8

Supervised Machine Learning (III) Rule of thumb if training data not expensive Training sample: 50% Validation sample: 25% Test sample: 25% often test sample = validation sample, i.e., training : validation/test = 50:50 Cross validation (efficient use of scarce training data) Split training sample in k independent subset Tk of the full sample T Train on T \ Tk resulting in k different classifiers For each training event there is one classifier that didn't use this event for training Validation results are then combined 9

Estimating PDFs from Histograms? Consider 2d example: G. Cowan': https://www.pp.rhul.ac.uk/~cowan/stat_course.html signal background Approximate approximate pdfs PDF using byn(x,y s), y S) N(x,y b) andn(x, in y B) corresponding cells. M bins per variable in d dimensions: M d cells hard to generate enough training data (often not practical for d > 1) In general in machine learning, problems related to a large number of dimensions of the feature space are referred to as the "curse of dimensionality" 10

k-nearest Neighbor Method (I) k-nn classifier Estimates probability density around the input vector p(~x S) and p(~x B) are approximated by the number of signal and background events in the training sample that lie in a small volume around the point ~x Algorithms finds k nearest neighbors: k = k s + k b Probability for the event to be of signal type: p s (~x) = k s (~x) k s (~x)+k b (~x) 11

k-nearest Neighbor Method (II) Simplest choice for distance measure in feature space is the Euclidean distance: 1.5 TMVA manual https://root.cern.ch/guides/tmva-manual R = ~x ~y 1 Better: take correlations between variables into account: R = q (~x ~y) T V 1 (~x ~y) x 1 0.5 V = covariance matrix "Mahalanobis distance" 0 0.5 1 1.5 x 0 Figure 14: Example for the k-nearest The k-nn classifier has best performance when the boundary that separates signal and background events has irregular discriminating features that input cannot variables). be easily The thre approximated by parametric learning methods. The full (open) circles are the signal (bac in the nearest neighborhood (circle) of th signal and 7 background points so that 12 q

Kernel Density Estimator (KDE) [Just mentioned briefly here] Idea: Smear training data to get an estimate of the PDFs ˆf h (~x) = 1 n nx i=1 K h (~x ~x i )= 1 nh nx i=1 K( ~x ~x i h ) K = Kernel h = bandwidth (smoothing parameter) Use, e.g., Gaussian kernel: K(~x) = 1 ~x e 2 /2 d = dimension of feature space (2 ) d/2 x 2 Lecture Niklas Berger x 1 13

Gaussian KDE in 1 Dimension G. Cowan': https://www.pp.rhul.ac.uk/~cowan/stat_course.html Adaptive KDE: adjust width of kernel according to PDF (wide where pdf is low) Advantage of KDE: no training! Disadvantage of KDE: need to sum of all training events to evaluate PDF, i.e., method can be slow 14

Fisher Linear Discriminant Linear discriminant is simple. Can still be optimal if amount of training data is limited. n nx Ansatz for test statistic: yx= y(~x) w= i x i = ~w T i x i = w T x ~x i=1 i=1 f y s, f y b Choose parameters wi so that separation between signal and background distribution is maximum. Need to define "separation". f (y) t s t b Fisher: maximize J(~w) = ( s b ) 2 2 s + 2 b s b J w= s b 2 G. Cowan': https://www.pp.rhul.ac.uk/~cowan/stat_course.html s 2 b 2 y 15

Fisher Linear Discriminant: Variable Definitions Mean and covariance for signal and background: µ s,b i = V s,b ij = Z Z x i f (~x H s,b )d~x (x i µ s,b i )(x j µ s,b j ) f (~x H s,b )d~x Mean and covariance of y(~x) for signal and background: Z s,b = y(~x)f (~x H s,b )d~x = ~w T ~µ s,b Z 2 s,b = (y(~x) s,b ) 2 f (~x H s,b )d~x = ~w T V s,b ~w G. Cowan': https://www.pp.rhul.ac.uk/~cowan/stat_course.html 16

Fisher Linear Discriminant: Determining the Coefficients wi Numerator of J(~w) : ( s b ) 2 = J(~w) Denominator of : nx i=1 2 s + 2 b = w i (µ s i µ b i )! 2 = nx i,j=1 nx i,j=1 nx i,j=1 w i w j (µ s i µ b i )(µ s j µ b j ) w i w j B ij = ~w T B ~w w i w j (V s + V b ) ij ~w T W ~w Maximize: J(~w) = ~w T B ~w ~w T W ~w = separation between classes separation within classes G. Cowan': https://www.pp.rhul.ac.uk/~cowan/stat_course.html 17

Fisher Linear Discriminant: Determining the Coefficients wi Setting @J @w i =0 gives: y(~x) =~w T ~x with ~w / W 1 (~µ s ~µ b ) We obtain linear decision boundaries. linear decision boundary Weight vector ~w can be interpreted as a direction in feature space on which the events are projected. G. Cowan': https://www.pp.rhul.ac.uk/~cowan/stat_course.html 18

Fisher Linear Discriminant: Remarks In case the signal and background pdfs f (~x H s ) and f (~x H b ) are both multivariate Gaussian with the same covariance but different means, the Fisher discriminant is y(~x) / ln f (~x H s) f (~x H b ) That is, in this case the Fisher discriminant is an optimal classifier according to the Neyman-Pearson lemma (as y(~x) is a monotonic function of the likelihood ratio) Test statistic can be written as y(~x) =w 0 + nx i=1 w i x i where events with y > 0 are classified as signal. Same functional form as for the perceptron (prototype of neural networks). 19

Perceptron Discriminant: y(~x) =h w 0 + nx i=1 w i x i! The nonlinear, monotonic function h is called activation function. Typical choices for h: 1 1+e x ( sigmoid ), tanhx x 1 h(x) y(~x) x n x 20

The Biological Inspiration: the Neuron Human brain: 10 11 neurons, each with on average 7000 synaptic connections https://en.wikipedia.org/wiki/neuron y l 1 1 y l 1 2. y l 1 n Input w l 1 1j w l 1 2j w l 1 nj ρ Output Σ y l j 21

Feedforward Neural Network with One Hidden Layer superscripts indicates layer number 1(~x) y(~x) i(~x) =h 0 @w (1) i0 + nx j=1 1 w (1) ij x j A m(~x) y(~x) =h 0 @w (2) 10 + m X j=1 w (2) 1j 1 j(~x) A Straightforward to generalize to multiple hidden layers 22

Network Training ~x a : training event, a = 1,..., N t a : correct label for training event a e.g., ta = 1, 0 for signal and background, respectively ~w : vector containing all weights Error function: E(~w) = 1 2 NX (y(~x a, ~w) t a ) 2 = NX E a (~w) a=1 a=1 Weights are determined by minimizing the error function. 23

Backpropagation ~w (0) Start with an initial guess for the weights an then update weights after each training event: ~w ( +1) = ~w ( ) re a (~w ( ) ) learning rate Let's write network output as follows: mx y(~x) =h(u(~x)) with u(~x) = w (2) 1j j(~x), j(~x) =h Here we defined φ0 = x0 = 1 and the sums start from 0 to include the offsets. Weights from hidden layer to output: E a = 1 2 (y a t a ) 2! @E a @w (2) 1j j=0 =(y a t a )h 0 (u(~x a )) @u @w (2) 1j =(y a t a )h 0 (u(~x a )) j (~x a ) Weights from input layer to hidden layer ( further application of chain rule): nx k=0 w (1) jk x k! h (v j (~x)) @E a @w (1) jk =(y a t a )h 0 (u(~x a ))w (2) 1j h 0 (v j (~x a )) x a,k ~x a (x a,1,...,x a,n ) 24

Neural Network Output and Decision Boundaries P. Bhat, Multivariate Analysis Methods in Particle Physics, inspirehep.net/record/879273 a Input layer Hidden layer Output layer 800 b x 1 x 2 x 3 Number of events 600 400 Signal Background output of neural network 200 x d x i h j 0(x) = f(x,w) 0 0 0.2 0.4 0.6 0.8 1 NN output 25 c d 20 Signal Background NN contours 1 0.8 0.6 decision boundaries for different cuts on NN output Variable x 2 15 10 5 0.02 0 0 50 100 150 200 Variable x 1 0.4 0.8 0.95 0.1 p(s x 1, x 2 ) 0.5 0 25 20 Variable x 2 15 10 5 0 0 50 200 150 100 Variable x 1 0.4 0.2 signal probability p(s x1, x2) 25

Example of Overtraining Too many neurons/layers make a neural network too flexible overtraining training sample G. Cowan: https://www.pp.rhul.ac.uk/~cowan/stat_course.html test sample Network "learns" features that are merely statistical fluctuations in the training sample 26

Monitoring Overtraining G. Cowan: https://www.pp.rhul.ac.uk/~cowan/stat_course.html Monitor fraction of misclassified events (or error function:) error rate optimum = minimum of error rate for test sample overtraining = increase of error rate test sample training sample flexibility (e.g., number of nodes/layers) 27

Example: Identification of Events with Top-Quarks a t t! W + bw b! l bq q b Shaded = signal b likelihood method Background Signal Events (arbitrary units) 0 50 100 150 0 0.1 0.2 0.3 0.4 x 1 = missing E T x 2 = aplanarity Events (arbitrary units) 0 0.2 0.4 0.6 0.8 1 D LB Background Signal 0 0.5 1 1.5 2 0 0.5 1 1.5 x 3 x 4 D0 experiment, plot from inspirehep.net/record/879273 neural net 0 0.2 0.4 0.6 0.8 1 D NN 28

Deep Neural Networks Deep networks: many hidden layers with large number of neurons Challenges Hard too train ("vanishing gradient problem") Training slow Risk of overtraining Big progress in recent years Interest in NN waned before ca. 2006 Milestone: paper by G. Hinton (2006): "learning for deep belief nets" Image recognition, AlphaGo, Soon: self-driving cars, http://neuralnetworksanddeeplearning.com 29

Fun with Neural Nets in the Browser http://playground.tensorflow.org 30

Decision Trees (I) arxiv:physics/0508045v1 root node B 4/37 S 7/1 S/B 52/48 < 100 100 PMT Hits? S/B 9/10 S/B 48/11 < 0.2 GeV 0.2 GeV Energy? < 500 cm 500 cm Radius? B 2/9 branch node (node with further branching) S 39/1 MiniBooNE: 1520 photomultiplier signals, goal: separation of νe from νμ events leaf node (no further branching) Leaf nodes classify events as either signal or background 31

Decision Trees (II) Ann.Rev.Nucl.Part.Sci. 61 (2011) 281-309 a Root node {x 1 5, x 2 } b x 1 θ 1 x 1 > θ 1 x 2 R 3 θ 3 x 2 θ 2 x 2 > θ 2 x 2 θ 3 x 2 > θ 3 R 2 R 1 R 2 R 3 θ 2 R 4 R 5 R 1 x 1 θ 4 x 1 > θ 4 θ 1 θ 4 x 1 R 4 R 5 Easy to interpret and visualize: Space of feature vectors split up into rectangular volumes (attributed to either signal or background) How to build a decision tree in an optimal way? 32

Finding Optimal Cuts Separation btw. signal and background is often measured with the Gini index: G = p(1 p) Here p is the purity: p = P signal w i P signal w i + P background w i w i = weight of event i [usefulness of weights will become apparent soon] Improvement in signal/background separation after splitting a set A into two sets B and C: = W A G A W B G B W C G C where W X = X X w i 33

Separation Measures arbitrary unit 0.25 0.2 0.15 0.1 0.05 Split criterion Misclas. error Entropy Gini 0 0 0.2 0.4 0.6 0.8 1 signal purity Cross entropy: p ln p (1 p)ln(1 p) p Gini index: p(1 p) [after Corrado Gini, used to measure income and wealth inequalities, 1912] Misclassification rate: 1 max(p,1 p) 34

Decision Tree Pruning When to stop growing a tree? When all nodes are essentially pure? Well, that's overfitting! Pruning Cut back fully grown tree to avoid overtraining 35

Boosted Decision Trees: Idea Drawback of decisions trees: very sensitive to statistical fluctuations in training sample Solution: boosting One tree several trees ("forrest") Trees are derived from the same training ensemble by reweighting events Individual trees are then combined: weighted average of individual trees Boosting is a general method of combining a set of classifiers (not necessarily decisions trees) into a new, more stable classifier with smaller error. Popular example: AdaBoost (Freund, Schapire, 1997) 36

AdaBoost (short for Adaptive Boosting) Initial training sample with equal weights normalized as nx Train first classifier f1: ~x 1,...,~x n : multivariate event data y 1,...,y n : true class labels, +1 or 1 w (1) (1) 1,...,w n event weights i=1 w (1) i =1 f 1 (~x i ) > 0 f 1 (~x i ) < 0 classify as signal classify as background 37

Updating Events Weights Define training sample k+1 from training sample k by updating weights: w (k+1) i = w (k) i e k f k (~x i )y i /2 Z k i = event index normalization factor so that nx i=1 w (k) i =1 Weight is increased if event was misclassified by the previous classifier "Next classifier should pay more attention to misclassified events" At each step the classifier fk minimizes error rate " k = nx i=1 w (k) i I (y i f k (~x i ) apple 0), I (X )=1ifX is true, 0 otherwise 38

Assigning the Classifier Score Assign score to each classifier according to its error rate: k =ln 1 " k " k Combined classifier (weighted average): f (~x) = KX k=1 k f k (~x) It can be shown that the error rate of the combined classifier satisfies " apple KY k=1 2 p " k (1 " k ) 39

General Remarks on Multi-Variate Analyses MVA Methods More effective than classic cut-based analyses Take correlations of input variables into account Important: find good input variables for MVA methods Good separation power between S and B Little correlations among variables No correlation with the parameters you try to measure in your signal sample! Pre-processing Apply obvious variable transformations and let MVA method do the rest Make use of obvious symmetries: if e.g. a particle production process is symmetric in polar angle θ use cos θ and not cos θ as input variable It is generally useful to bring all input variables to a similar numerical range H. Voss, Multivariate Data Analysis and Machine Learning in High Energy Physics http://tmva.sourceforge.net/talks.shtml 40

Classifiers and Their Properties H. Voss, Multivariate Data Analysis and Machine Learning in High Energy Physics http://tmva.sourceforge.net/talks.shtml Classifiers Criteria Cuts Likelihood PDERS / k-nn H-Matrix Fisher MLP BDT RuleFit SVM Performance no / linear correlations nonlinear correlations Speed Training Response / Robust -ness Overtraining Weak input variables Curse of dimensionality Transparency 41