Multivariate Data Analysis and Machine Learning in High Energy Physics (III)

Similar documents
Advanced statistical methods for data analysis Lecture 2

Introduction to Natural Computation. Lecture 9. Multilayer Perceptrons and Backpropagation. Peter Lewis

Learning from Data: Multi-layer Perceptrons

Lecture 5: Logistic Regression. Neural Networks

Statistical Methods in Particle Physics

Multivariate Analysis, TMVA, and Artificial Neural Networks

1 What a Neural Network Computes

Multilayer Perceptrons and Backpropagation

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

Intelligent Systems Discriminative Learning, Neural Networks

Linear & nonlinear classifiers

18.6 Regression and Classification with Linear Models

Linear & nonlinear classifiers

Multivariate statistical methods and data mining in particle physics

Applied Statistics. Multivariate Analysis - part II. Troels C. Petersen (NBI) Statistics is merely a quantization of common sense 1

Machine Learning Lecture 7

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

AI Programming CS F-20 Neural Networks

Linear Models for Classification

Multilayer Perceptron

Classification with Perceptrons. Reading:

Artificial Neural Networks

Lecture 4: Perceptrons and Multilayer Perceptrons

MLPR: Logistic Regression and Neural Networks

Outline. MLPR: Logistic Regression and Neural Networks Machine Learning and Pattern Recognition. Which is the correct model? Recap.

Statistical Methods for Particle Physics Lecture 2: multivariate methods

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Artificial Neural Networks. MGS Lecture 2

Lecture 3: Pattern Classification. Pattern classification

Week 5: Logistic Regression & Neural Networks

Computational statistics

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Machine Learning Lecture 5

Lecture 3: Pattern Classification

Neural networks. Chapter 19, Sections 1 5 1

The exam is closed book, closed notes except your one-page cheat sheet.

Multivariate Analysis Techniques in HEP

L11: Pattern recognition principles

y(x n, w) t n 2. (1)

The Bayes classifier

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

CS 179: LECTURE 16 MODEL COMPLEXITY, REGULARIZATION, AND CONVOLUTIONAL NETS

Mining Classification Knowledge

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Multilayer Perceptron = FeedForward Neural Network

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

Neural networks. Chapter 20. Chapter 20 1

Multi-layer Neural Networks

Multivariate Data Analysis Techniques

<Special Topics in VLSI> Learning for Deep Neural Networks (Back-propagation)

Classification: The rest of the story

Statistical Methods for Particle Physics Lecture 2: statistical tests, multivariate methods

Hypothesis testing:power, test statistic CMS:

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Kernel Methods. Barnabás Póczos

Notes on Discriminant Functions and Optimal Classification

Machine Learning 2nd Edition

Pattern Recognition and Machine Learning

Introduction Neural Networks - Architecture Network Training Small Example - ZIP Codes Summary. Neural Networks - I. Henrik I Christensen

Feature Engineering, Model Evaluations

CSC242: Intro to AI. Lecture 21

Introduction to Neural Networks

From perceptrons to word embeddings. Simon Šuster University of Groningen

CSC321 Lecture 5: Multilayer Perceptrons

Neural networks. Chapter 20, Section 5 1

ESS2222. Lecture 4 Linear model

Introduction to Neural Networks

Linear discriminant functions

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

FINAL: CS 6375 (Machine Learning) Fall 2014

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Pattern Classification

Machine Learning 2017

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Learning from Data Logistic Regression

Feed-forward Network Functions

CSC321 Lecture 8: Optimization

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

A summary of Deep Learning without Poor Local Minima

Classification. Sandro Cumani. Politecnico di Torino

Machine Learning (CS 567) Lecture 5

Machine Learning Lecture 10

Lecture 3 Feedforward Networks and Backpropagation

CSC 411 Lecture 10: Neural Networks

Bits of Machine Learning Part 1: Supervised Learning

Backpropagation Introduction to Machine Learning. Matt Gormley Lecture 12 Feb 23, 2018

AN INTRODUCTION TO NEURAL NETWORKS. Scott Kuindersma November 12, 2009

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Midterm: CS 6375 Spring 2018

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

The Multi-Layer Perceptron

STA 414/2104: Lecture 8

9 Classification. 9.1 Linear Classifiers

Transcription:

Multivariate Data Analysis and Machine Learning in High Energy Physics (III) Helge Voss (MPI K, Heidelberg) Graduierten-Kolleg, Freiburg, 11.5-15.5, 2009

Outline Summary of last lecture 1-dimensional Likelihood: data preprocessing decorrelation Recap of Fisher s Discriminant Neural Networks Helge Voss 2

What we ve learned yesterday: Neyman Pearson: Actually he states the obvious if you think about it: if you want to know to which probability an event with measured variables x is either signal or background, use the ratio between the true probability between observing an event with x from either signal or background. Who is surprised that this is the best you can do? Unfortunately we basically never have access to the true probability density : hence we try to either estimate the PDF S (x) and PDF B (x) using Kernel Density Estimator/Mulitdimensional likelihood (effectively averages of the PDF over regions of the variable space, derived from training events) curse of dimensionality!! or neglect correlations and use 1-dimensional PDF s no problems with dimensionality or try a different approach of directly determining hyperplanes in the feature space that separate signal and background events. There is no magic, you still need to tune parameters (e.g size of the region you want to average over in PDF estimate) in order to get good results. Helge Voss 3

4 Naïve Bayesian Classifier often called: (projective or 1D) Likelihood If correlations between variables are weak: D p( x) p i( x) i= 0 Likelihood ratio for event event yx ( ) = PDE, k event i PDFs { sse } { variables} C cla s i signal p i (x i, k ) { variables} event C p i (x i, k ) event discriminating variables Classes: signal, background types If the correlations between variables is really negligible, this classifier is perfect (simple, robust, performing) If not, you seriously loose performance How can we fix this? var i

What if there are correlations? 5 Typically correlations are present: C ij =cov[ x i, x j ]=E[ x i x j ] E[ x i ]E[ x j ] 0 (i j) pre-processing: choose set of linear transformed input variables for which C ij = 0 (i j)

Decorrelation 6 Determine square-root C of correlation matrix C, i.e., C = C C T T compute C by diagonalising C: D = S CS C = S DS transformation from original (x) in de-correlated variable space (x ) by: x = C 1x Attention. This is is able to eliminate only linear correlations!!

Decorrelation 7 Determine square-root C of correlation matrix C, i.e., C = C C T T compute C by diagonalising C: D = S CS C = S DS transformation from original (x) in de-correlated variable space (x ) by: x = C 1x Attention. This is is able to eliminate only linear correlations!!

Decorrelation: Principal Component Analysis 8 PCA is typically used to: reduce dimensionality of a problem find the most dominant features in your distribution by transforming The eigenvectors of the covariance matrix with the largest eigenvalues correspond to the dimensions that have the strongest correlation in the data set. Along these axis the variance is largest sort the eigenvectors according to their eigenvalues Dataset is transformed in variable space along these eigenvectors Along the first dimension the data show the largest features, the smallest features are found in the last dimension. ( i ) x ( i ) PC ( k ) k event v event v v v { variables} { variab } x = x v, k les Principle Component (PC) of variable k sample means Matrix of eigenvectors V obey the relation: eigenvector CV = DV PCA eliminates correlations! correlation matrix diagonalised square root of C

How to Apply the Pre-Processing Transformation? Helge Voss 9 In general: the decorrelation for signal and background variables is different however: for a test event we don t know beforehand if it is signal or background.? What do we do? for likelihood ratio, decorrelate signal and background independently y L ( i ) event = k { variables} k { variables} p S k ( x ( i )) event B ( ( i )) + ( x ( i )) S k k event k k k k p x p { variables} event y trans L ( i ) event = k { variables} p S k S( ˆ S pk T xk( ievent )) k { variables} ( ˆS B ( event )) ( ˆB T xk i + pk T xk( ievent) ) k { variables} for other estimators, one need to decide on one of the two

10 Decorrelation at Work Example: linear correlated Gaussians decorrelation works to 100% 1-D Likelihood on decorrelated sample give best possible performance compare also the effect on the MVA-output variable! correlated variables: after decorrelation (note the different scale on the y-axis sorry)

Limitations of the Decorrelation 11 in cases with non-gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care How does linear decorrelation affect cases where correlations between signal and background differ? Original correlations Signal Background

Limitations of the Decorrelation 12 in cases with non-gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care How does linear decorrelation affect cases where correlations between signal and background differ? Original correlations SQRT decorrelation Signal Background

Limitations of the Decorrelation 13 in cases with non-gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care How does linear decorrelation affect strongly nonlinear cases? Original correlations Signal Background

Limitations of the Decorrelation 14 in cases with non-gaussian distributions and/or nonlinear correlations, the decorrelation needs to be treated with care How does linear decorrelation affect strongly nonlinear cases? Original correlations SQRT decorrelation Signal Background Watch out before you used decorrelation blindly!!

Gaussian-isation 15 Improve decorrelation by pre-gaussianisation of variables First: transformation to achieve uniform (flat) distribution: flat k ( i ) event ( i ) ( x ) event x k { variabl } x = p dx, k es Rarity transform of variable k Measured value PDF of variable k The integral can be solved in an unbinned way by event counting, or by creating non-parametric PDFs (see later for likelihood section) k k k Second: make Gaussian via inverse error function: erf x at ( ) x ( ) Gauss 1 fl k event ( x) 2 = e π ( k event ), { } i = 2 erf 2 i 1 k variables x 0 t 2 dt Third: decorrelate (and iterate this procedure)

Gaussian-isation 16 Background Signal Background Signal -Original -Original - - Gaussianised Gaussianised We cannot simultaneously Gaussianise both signal and background?!

Gaussian-isation 17 Background Signal Background Signal -Original -Original - - Gaussianised Gaussianised We cannot simultaneously Gaussianise both signal and background?!

Linear Discriminant ( {({})wii1})18 If non parametric methods like k-nearest-neighbour or Kernel Density Estimators suffer from lack of training data curse of dimensionality slow response time need to evaluate the whole training data for every test event use of parametric models y(x) to fit to the training data yx=x,.,xd 1M ii=0i out: these lines are NOT the linear function =Watch y(x), wh(x) Linear Discriminant: yx=x,.,x=w+i.e. any linear function of the input variables: 1D0 Dgiving rise to linear decision boundaries i=xh1 x 2 How do we determine the weights w that do best?? H 0 but rather the decision boundary given by y(x)=const. x 1

19 Helge Voss Fisher s Linear Discriminant 0 + Di=1y(x={x,.,x})=y(x,w)=wwxHow do we determine the weights w that do best?? y(x) Maximise the separation between the two classes S and B minimise overlap of the distribution y S (x) and y B (x) maximise the distance between the two mean values of the classes minimise the variance within each class y S (x) y B (x) maximise B S 2 B S 2 2 y y (E(y ) - E(y )) J(w) = σ + σ TTwBw"inbetween"variance==wWw"within"variancenote: these quantities can be calculated from the training data -1wSB J(w)=0 w W(x-x)the Fisher coefficients

Linear Discriminant and non linear correlations Helge Voss 20 assume the following non-linear correlated data: the Linear discriminant obviousl doesn t do a very good job here: Of course, these can easily de-correlated: here: linear discriminator works perfectly on de-correlated data var 0 = var 0 + var1 l 2 2 var 0 = var1 atan var1

Linear Discriminant with Quadratic input: 21 A simple extension of the linear decision boundary to quadratic decision boundary: while: var0 var1 var0 * var0 var1 * var1 var0 * var1 linear decision boundaries in var0,var1 quadratic decision boundaries in var0,var1 Performance of Fisher Discriminant: with quadratic input: Fisher Fisher with decorrelated variables Fisher with quadratic input

Training Classifiers and Loss-Function All classifiers (KNN,Likelihood,Fisher) could be calculated and didn t require parameter fitting Other classifiers provide a set of basis functions (or model) that need to optimally adjusted to find the appropriate separating hyperplane(surface) Let x R n be a random variable (i.e. our observables) and y(x) y a real valued output variable with joint distribution Pr(x,y) regression y value 1 for signal, y 0 for background classification we are looking for a function y(x) that predicts y given certain input variables: y(x):r n R: Loss function: L(y,y(x)) penalizes errors made in the prediction: EPE(y(x)) = E(y y(x)) 2 squared error loss, typical loss function used in regression by conditioning on x we can write: EPE(y(x)) = E( y-y(x) ) misclassification error, typical loss function for classification problems Helge Voss 22

Neural Networks 23 naturally, if we want to go the arbitrary non-linear decision boundaries, y(x) needs to be constructed in any non-linear fashion y(x)= M i ( w ) ih i(x) Think of h i (x) as a set of basis functions If h(x) is sufficiently general (i.e. non linear), a linear combination of enough basis function should allow to describe any possible discriminating function y(x) K.Weierstrass Theorem: proves just that previous statement. Imagine you chose h i (x) to be such that: y(x)= w A w + w x 1 D A(x)= M D 0i i0 ij j i j=1 1-e x : the sigmoid function y(x) = a linear combination of non linear functions of of linear combination of the input data Now we only need to find the appropriate weights w

Neural Networks: Multilayer Perceptron MLP 24 But before talking about the weights, let s try to interpret the formula as a Neural Network: D var discriminating input variables as input + 1 offset input layer hidden layer ouput layer 1. i. w j1 w ij w 11 w 1j 1. j. k... output: y(x)= w A w + w x M 0i i0 D ij j i j=1 Activation function e.g. sigmoid: Ax ( ) = 1+ x ( e ) 1 D or tanh D+1 M 1 Nodes in hidden layer represent the activation functions whose arguments are linear combinations of input variables non-linear response to the input The output is a linear combination of the output of the activation functions at the internal nodes Input to the layers from preceding nodes only feed forward network (no backward loops) It is straightforward to extend this to several input layers

Neural Networks: Multilayer Perceptron MLP 25 But before talking about the weights, let s try to interpret the formula as a Neural Network: D var discriminating input variables as input + 1 offset input layer hidden layer ouput layer 1. i. w j1 w ij w 11 w 1j 1. j. k... output: y(x)= w A w + w x M 0i i0 D ij j i j=1 Activation function e.g. sigmoid: Ax ( ) = 1+ x ( e ) 1 D or tanh D+1 M 1 nodes neurons links(weights) synapses Neural network: try to simulate reactions of a brain to certain stimulus (input data)

Neural Network Training forc=26 idea: using the training events adjust the weights such, that y(x) 0 for background events y(x) 1 for signal events how do we adjust? minimize Loss function: yevents i 2 where ( C) i L(w) = (y(x ) y(c)) i.e. use usual sum of squares 0f 1signal=or. C=backgrpredicted true event type event type y(x) is a very wiggly function with many local minima. A global overall fit in the many parameters is possible but not the most efficient method to train neural networks back propagation (learn from experience, gradually adjust your perception to match reality online learning (learn event by event -- not only at the end of your life from all experience)

Neural Network Training back-propagation and online learning () 27 start with random weights adjust weights in each step a bit in the direction of the steepest descend of the Loss - function L n 1 n w w η η learning rate = i + = + LwwM for weights connected to output nodes = D i j=1 L w 0i D = y(x)-y(c) A w + w x j=1 ( ) i0 ij j for weights connected to output nodes a bit more complicated formula note: all these gradients are easily calculated from the training event L(w) (y(x ) y(c)) y(x)= w A w + w x 0i i0 ij j 2 training is repeated n-times over the whole training data sample. how often?? for online learning, the training events should be mixed randomly, otherwise you first steer in a wrong direction from which it is afterwards hard to get out again!

Overtraining 28 training is repeated n-times over the whole training data sample. how often?? it seems intuitive that this boundary will give better results in another statistically independent data set than that one x 2 S x 2 S B B stop training before you learn statistical fluctuations in the data x 1 verify on independent test sample possible overtraining is concern for every tunable parameter α of classifiers: Smoothing parameter, n-nodes classificaion error x 1 test sample training sample training cycles α

29 Cross Validation many (all) classifiers have tuning parameters α that need to be controlled against overtraining number of training cycles, number of nodes (neural net) smoothing parameter h (kernel density estimator). the more free parameter a classifier has to adjust internally more prone to overtraining more training data better training results division of data set into training and test sample reduces the training data Cross Validation: divide the data sample into say 5 sub-sets Train Test Train Test Train Test Test Train Train Test train 5 classifiers: y i (x,α) : i=1,..5, classifier y i (x,α) is trained without the i-th sub sample calculate the test error: events 1 CV( α) = L(y i(x k, α)) L : loss function N events choose tuning parameter α for which CV(α) is minimum and train the final classifier using all data k

30 What is the best Network Architecture? Theoretically a single hidden layer is enough for any problem, provided one allows for sufficient number of nodes. (K.Weierstrass theorem) Relatively little is known concerning advantages and disadvantages of using a single hidden layer with many nodes over many hidden layers with fewer nodes. The mathematics and approximation theory of the MLP model with more than one hdden layer is not very well understood. nonetheless there seems to be reason to conjecture that the two hidden layer model may be significantly more promising than the single hidden layer model (Glen Cowan) A.Pinkus, Approximation theory of the MLP model with neural networks, Acta Numerica (1999),pp.143-195 Typically in high-energy physics, non-linearities are reasonably simple, in many cases 1 layer with a larger number of nodes should be enough but it might well be worth trying more layers (and less nodes in each layer)