Multivariate Analysis Techniques in HEP

Multivariate Analysis Techniques in HEP Jan Therhaag IKTP Institutsseminar, Dresden, January 31 st 2013

Multivariate analysis in a nutshell Neural networks: Defeating the black box Boosted Decision Trees: Crowd wisdom 2

Separating signal and background taking optimal decisions Nomenclature: Think of an event as an ensemble of measured features (variables) Best separation of signal and background is based on likelihood ratio Problem 1: Usually no analytical expression for is available resort to MC and histograms Problem 2: We are not dealing with one variable, but with many curse of dimensionality kicks in 3

Separating signal and background MVA to the rescue Goal: examine MC and condense all relevant variables into one optimal discriminator reconstruction of the optimal decision boundary Must be flexible enough to model the underlying distributions Must be rigid enough to deal with sparsely populated regions 4

* * Adapted from Zamora-López et al. (2010). 6

* * Adapted from Zamora-López et al. (2010). 7

The Problem x 2 x 1 8

The single neuron as a classifier 9

A simple approach: Assign discrete values to the classes x 2 (here: blue = -1, orange = 1) Perform a linear fit to this discrete function Define the decision boundary by x 1 10

Now consider the sigmoid transformation: 7! has values in [0,1] and can be interpreted as the probability p(orange x) (then obviously p(blue x) = 1- p(orange x) = ) 11

We have just invented the neuron! 1 1 x 1 w 1 w 0 x 1 w 1 w 0 x N w N x N w N neuron is called the activity of the neuron, while is called the activation most of the time, we will only consider the activity neuron behavior is entirely controlled by the weights w={w 0, w N } 12

Possible realizations of the neuron - the weight space 13

The training proceeds via minimization of the error function E(w) The neuron learns via gradient descent E (! ) " < " opt * a)! min! * from LeCun, Bottou, Orr, Mueller: Efficient BackProp, Neural Networks: tricks of the trade, Springer 1998 14

The training proceeds via minimization of the error function E(w) The neuron learns via gradient descent 15

From neurons to networks 16

The universal approximation theorem Let be a non-constant, bounded, and monotone-increasing continuous function. Let denote the space of continuous functions on the D- dimensional hypercube. Then, for any given function and there exist an integer M and sets of real constants where and such that is an approximation of, that is 17

This architecture is known as feedforward networks Neurons are organized in layers The output of a neuron in one layer becomes the input for the neurons in the next layer z j 19

Any continuous function can be approximated with arbitratry precision Function complexity is determined by number of hidden units (neurons) and characteristic magnitude of weights z 1 z 2 z 3 output training data 1 input, 1 output, 3 hidden neurons 20

Overtraining in neural networks A network with many neurons may adapt too well to the training data NN with 10 hidden units 21

Overtraining in neural networks A network with many neurons may adapt too well to the training data Test data may be used to monitor convergence / stop training But training data is valuable, especially in high dimensional problems Undesirable effects may develop locally before a satisfactory global configuration is found Network size may be limited But complexity of problem is often not known beforehand NN with 10 hidden units 22

Regularization by weight decay Penalizing large weights explicitly avoids overtraining complexity is not limited from the start decision boundaries are smoothened weights connecting inputs to hidden neurons Left: w/o weight decay, Right: weight decay 23

Bayesian neural networks 25

Network training as inference Reminder: The output of a neuron may be interpreted as a probability This is our old error function, remember? 26

Network training as inference Reminder: The output of a neuron may be interpreted as a probability Similarly, we can interpret the weight decay term as a log probability distribution for w Looks like a Gaussian 27

Optimal network complexity Let s take another look at the posterior for the network parameters w: We want: 30

Optimal network complexity Let s take another look at the posterior for the network parameters w: We want: The optimal regularization can be found by evaluating the evidence for a given α: 31

The evidence framework The integral involved in the evaluation of the evidence is analytically intractable 32

The evidence framework The integral involved in the evaluation of the evidence is analytically intractable Way out: Laplace approximation Consider Taylor expand around maximum with This is exact if f(x) is Gaussian! 33

The evidence framework Occam terms 34

The evidence framework model complexity model complexity 35

The evidence framework model complexity NN with 8 hidden neurons model complexity green: optimal decision boundary black: result obtained w/o regularization red: after optimizing α using the evidence 36

Input variable relevance determination In a real-life HEP problem, it may not be obvious which variables have predictive power Good variables may be automatically selected if we allow for individual regularization of the input weights useless variables have their weights reduced including more variables does not hurt overall performance Strategy known as Automatic Relevance Determination (ARD) 37

Decision trees and boosting 38

Decision trees and boosting a perfect match (?) Decision trees are a natural extension of binary cuts most events are not exactly background-/signal-like but share properties of both classes instead of discarding events that fail one criterion, look at remaining criteria to classify them correctly Boosting refers to the process of combining several weak learners into a powerful one very general approach, not limited to decision trees helps to stabilize classifier performance by smoothing out statistical features embodies the principles of crowd wisdom and majority vote...but what makes BDT so popular in HEP? 39

Growing a decision tree 1. The root node of the tree corresponds to the full sample of events 2. Sort all events by each variable 3. Go through sorted lists and find best split value for each variable figure of merit is purity of signal/background in the two subsamples generated by the split 4. Select variable which provides the best split 5. Create two branches which contain the events passing/failing the optimal split identified in step 4 6. Repeat recursively from step 2, using new nodes created in step 5 as starting point variables may be used more than once 7. Stop when stopping criterion is reached maximum number of nodes, minimum number of events in final node, etc. 40

Consider a sample of events (signal +background) which is described by three variables (,, ) Growing a decision tree example* Sort all events by each variable * adapted from Y. Coadou s talk at SOS2012 41

Consider a sample of events (signal +background) which is described by three variables (,, ) Growing a decision tree example* Sort all events by each variable Look for best split in each variable (arbitrary units) GeV, separation = 3 GeV, separation = 5 GeV, separation = 0.7 Split events: Pass or fail GeV * adapted from Y. Coadou s talk at SOS2012 42

Variables to consider -> see next slide Measure of separation Growing a decision tree adjustable parameters maximize purity -> minimize impurity misclassification error cross entropy Gini index note: significance is not a good figure of merit arbitrary unit 0.25 0.2 0.15 0.1 0.05 Split criterion Misclas. error Entropy Gini Criterion to declare a terminal node ( leaf ) minimum amount of events in leaf perfect purity already reached in leaf no split offers sufficient improvement (careful!) maximum number of nodes/depth of tree reached 0 0 0.2 0.4 0.6 0.8 1 signal purity Continuous or binary output for each event can return purity of terminal leaf or just assign each event to the class which dominates in the terminal leaf Seems confusing but will see there are preferred choices when boosting comes into play 44

Strengths of decision trees Non-informative variables do not disturb tree performance no useful split -> never used Duplicate variables do not change the tree Order of training events does not matter Order of variables does not matter Continuous and discrete variables are handled in the same manner Monotonous transformations of variables leave tree unchanged rescaling, unit change etc. Good immunity against outliers 45

Strengths/limitations of decision trees Non-informative variables do not disturb tree performance no useful split -> never used Duplicate variables do not change the tree Order of training events does not matter Order of variables does not matter Continuous and discrete variables are handled in the same manner Monotonous transformations of variables leave tree unchanged rescaling, unit change etc. Good immunity against outliers Output is discrete as many values as terminal nodes Instability: Small changes in training sample can lead to very different tree structure Need deep trees to map complex features of the input space statistics in sub-regions created by the tree becomes small classifier picks up fluctuations overtraining occurs Identification of powerful variables is not straightforward one variable may shadow a correlated one sometimes a less beneficial split may lead to a very powerful one later 46

Boosting exploiting the wisdom of crowds Idea: Creating a single powerful classifier is hard, but creating many simple ones is easy -> combine several weak learners to form a powerful one First proposal by Schapire (1990): Train classifier T1 on N events Train T2 on N different events, half of which misclassified by T1 Train T3 on events where T1 and T3 disagree Final classification is majority vote of T1, T2, T3 Incorporation of new ideas by Freund led to the invention of AdaBoost (1996) continues to be the boosting algorithm most widely used in HEP BDT = decision trees + AdaBoost 47

AdaBoost 1. Train classifier on training sample with event weights 4. Increase weights of misclassified events 2. Calculate misclassification rate 5. Train classifier on reweighted sample and repeat steps 1-4 3. Derive tree weight 6. Final classification function after M iterations is signal background current tree weighted sum of trees 1st tree 10th tree 150th tree 48

AdaBoost + decision trees = great off-the-shelf performance Can be shown that misclassification rate of the final classifier is bounded (on the training sample!): converges to zero for <0.5 prone to overtraining? In practice, best test performance is reached when the base learners are weak slower convergence but little overtraining many classifiers needed -> fast base learner is desirable Trees are perfectly suited for combination with AdaBoost misclassification rate can be adjusted by tree depth small trees are very robust fast to train almost no tuning necessary discrete output structure of trees is smoothened by averaging 49

AdaBoost a closer look Boosting: basis function (e.g. decision tree) Neural net: basis function (neuron) Both learning approaches can be understood as an additive expansion, but boosting is a greedy algorithm, i.e. it determines one term at a time, leaving the other terms untouched. -> boosting is computationally less expensive 50

AdaBoost a closer look Generally, an additive expansion takes the form These models are optimized by minimizing a loss function on the training data are the parameters of the base learner (split variables and values for trees, weights for neurons etc.) computationally expensive, only feasible in particular cases (neural networks, etc.) AdaBoost is an example for approximation by forward stagewise additive modeling: 1. Initialize 2. For k=1 to M: 1. Compute 2. Set 51

AdaBoost can we do better? It turns out that the loss function minimized by AdaBoost is exponential loss L(y, f(x)) exp. loss assume / for correctly classified / misclassified events large penalty for misclassified events sensitive to outliers and noisy settings binomial deviance asymptote Other loss functions may be better suited consider binomial deviance: y f(x) asymptotically linear for large negative margin better balance in spreading the influence among the data 52

Gradient boosting No reweighting prescription similar to AdaBoost exists for loss functions other than exponential loss Example: y=1 if, y=-1 otherwise Best approximation to minimize general loss functions: Calculate gradient of the loss function with respect to grow next tree to minimize the least squares error between the tree output and the gradient value at each training data point Gradient boosting Test Error 0.0 0.1 0.2 0.3 0.4 Stumps 10 Node 100 Node Adaboost 0 100 200 300 400 Number of Terms To be fair: I ve never seen that dramatic differences in HEP classification problems. (but seems to perform better in regression tasks) 53

References and further reading Figures taken from: (if not stated otherwise) David MacKay: Information Theory, Inference and Learning Algorithms Cambridge University Press 2003 Christopher Bishop: Pattern Recognition and Machine Learning Springer 2006 Hastie, Tibshirani, Friedman: The Elements of Statistical Learning, 2 nd Ed. Springer 2009 These books are also recommended for further reading 54

BACKUP 55

The training proceeds via minimization of the error function E(w) 56

The training proceeds via minimization of the error function E(w) The neuron learns via gradient descent Weight space 2 1.8 1.6 * Move in weight space: Examples may be learned all at once (batch learning) or one-by-one (online learning) 1.4 1.2 1 0.8 0.6 0.4 0.2 Going through the training data once is called an epoch 0 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 * from LeCun, Bottou, Orr, Mueller: Efficient BackProp, Neural Networks: tricks of the trade, Springer 1998 57

Overtraining Diverging weights may lead to overfitting Probabilities assigned by the neuron are too confident 58

Overtraining Diverging weights may lead to overfitting E(w) Probabilities assigned by the neuron are too confident 59

Very simple approach: stop after a fixed number of iterations (early stopping) and how to avoid it Next to simple approach: Monitor convergence with test sample 60

Very simple approach: stop after a fixed number of iterations (early stopping) and how to avoid it Next to simple approach: Monitor convergence with test sample Principled approach: Introduce regularization weight decay term 61

and how to avoid it Very simple approach: stop after a fixed number of iterations (early stopping) 2 0-2 -4-6 -8 w0 w1 w2 2 1 0-1 -2-3 w0 w1 w2 2 1 0-1 -2-3 w0 w1 w2 Next to simple approach: Monitor convergence with test sample Principled approach: Introduce regularization -10-12 1 10 100 1000 10000 100000 3 2.5 2 1.5 1 0.5 0-0.5-0.5 0 0.5 1 1.5 2 2.5 3 10-4 -5 1 10 100 1000 10000 100000 3 2.5 2 1.5 1 0.5 0-0.5-0.5 0 0.5 1 1.5 2 2.5 3 10-4 -5 1 10 100 1000 10000 100000 3 2.5 2 1.5 1 0.5 0-0.5-0.5 0 0.5 1 1.5 2 2.5 3 10 8 8 8 6 6 6 4 4 4 2 2 2 0 0 2 4 6 8 10 0 0 2 4 6 8 10 0 0 2 4 6 8 10 weight decay term 62

From neuron training to network training - backpropagation Remember: training the network means minimizing the error function E(w) Recall the single neuron: It turns out that: with and ± k = y k t k for output neurons else While input information is always propagated forward, errors are propagated backwards! 63

Network complexity vs. regularization Figure: Typical function with H=400 and weights randomly sampled from Gaussians with standard deviations In the limit of H! 1, (H = number of hidden units), function complexity is entirely determined by the typical size of the weights Output How much regularization do we need? 64

Summary Neurons and Networks A feedforward neural net is comprised of neurons arranged in layers It can approximate any continuous function to arbitrary precision Training of a neural net can be accomplished efficiently via backpropagation (see Backup) The complexity of the function represented by the NN is controlled by the typical size of the weights 65

The evidence framework 66

The evidence framework plugin expressions for likelihood and prior 67

The evidence framework use Laplace approximation 68

The evidence framework perform Gaussian integral 69

Predictions and confidence Goal: Predict class of new data point Standard approach: Calculate the network s output using the most probable value obtained in training 70

Predictions and confidence Goal: Predict class of new data point Standard approach: Calculate the network s output using the most probable value obtained in training Problem: Shouldn t we be less confident about points in sparsely populated regions? 71

Using the posterior to make predictions Instead of using, we can also exploit the full information in the posterior 72

Using the posterior to make predictions Instead of using, we can also exploit the full information in the posterior 73

Using the posterior to make predictions Instead of using, we can also exploit the full information in the posterior 74

Using the posterior to make predictions NN with 8 hidden neurons, α optimized using evidence left: using to make predictions right: prediction using Gaussian approximation to the posterior 75

Summary Bayesian NN Neural network training may be interpreted as an inference task Evaluating the evidence allows to determine the optimal amount of regularization model complexity Using independent regularization parameters for different input variables can help to select relevant inputs (ARD) model complexity Using the full posterior to make predictions naturally embodies uncertainties (see Backup) 76

Practical hints for NN users* * based on recommendations given in LeCun, Bottou, Orr, Mueller: Efficient BackProp Neural Networks: tricks of the trade, Springer 1998 77

Prefer online- over batch learning Neural Networks good practice avoids redundancy (useful in HEP) -> speed improvement may help to avoid local minima 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 Weight space 0 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 78

Prefer online- over batch learning Neural Networks good practice Shuffle the input samples networks learns more efficiently if successive samples are not from the same class 79

Prefer online- over batch learning Neural Networks good practice Shuffle the input samples Subtract the mean from the input variables Non-zero means create large eigenvalues in the Hessian matrix of the error function Eigenvalues of H determine the speed convergence E (! ) " < " opt E (! ) " = " opt E (! ) " = " opt a)! min! b)! min!! E (! ) E (! )! c! min " > " opt " > 2 " opt de/d!! c)! min! d)! min! d E(! c ) d!! c! min #! 80

Prefer online- over batch learning Neural Networks good practice Shuffle the input samples Subtract the means from the input variables Normalize the variances of the input variables Large spread in variation between input variables will produce a very eccentric error surface Eigenvalues will vary strongly in size 81

Prefer online- over batch learning Neural Networks good practice Shuffle the input samples Subtract the means from the input variables Normalize the variances of the input variables Decorrelate the input variables (at least if your NN does feature adaptive learning rates) For decorrelated inputs, eigenvectors of H point in direction of the coordinates Individual learning rates based on the different eigenvalues may be assigned 82

Neural network software to play with NeuroBayes: * proprietary software implementing a Bayesian neural network * efficient automatic preprocessing * very fast TMVA MLP: * open source, included in ROOT * regularization can be optionally turned on, uses approximations for evidence shown here * next version can directly interface to NeuroBayes * complete manual available from tmva.sourceforge.net 83

Summary Decision Trees and Boosting Single decision trees are robust with respect to variable choice and transformations, but unstable with respect to statistical features of the training data Boosting combines several weak learners into a powerful one works best with small trees has good out-of-the-box performance AdaBoost is an example of forward stagewise additive modeling minimizes exponential loss L(y, f(x)) exp. loss binomial deviance asymptote Some shortcomings of AdaBoost can be cured by choosing a different loss function leads to gradient boosting algorithm y f(x) 84