Advanced statistical methods for data analysis Lecture 2

Similar documents
Advanced statistical methods for data analysis Lecture 1

Multivariate statistical methods and data mining in particle physics

Statistical Methods for Particle Physics Lecture 2: multivariate methods

Statistical Methods in Particle Physics

Multivariate Data Analysis and Machine Learning in High Energy Physics (III)

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Machine Learning Lecture 5

Multi-layer Neural Networks

Intelligent Systems Discriminative Learning, Neural Networks

Lecture 5: Logistic Regression. Neural Networks

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

y(x n, w) t n 2. (1)

Lecture 3: Pattern Classification

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning for NLP

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning Lecture 7

Machine Learning (CSE 446): Neural Networks

Introduction to Logistic Regression and Support Vector Machine

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 3: Pattern Classification. Pattern classification

Logistic Regression & Neural Networks

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Neural Networks and Deep Learning

Neural networks. Chapter 20. Chapter 20 1

Artificial Neural Network

Mark Gales October y (x) x 1. x 2 y (x) Inputs. Outputs. x d. y (x) Second Output layer layer. layer.

ARTIFICIAL NEURAL NETWORK PART I HANIEH BORHANAZAD

Applied Statistics. Multivariate Analysis - part II. Troels C. Petersen (NBI) Statistics is merely a quantization of common sense 1

Linear Models for Classification

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

Neural networks. Chapter 19, Sections 1 5 1

Neural Networks and the Back-propagation Algorithm

Advanced Machine Learning

Multivariate Analysis, TMVA, and Artificial Neural Networks

Artificial Neural Networks

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning and Data Mining. Multi-layer Perceptrons & Neural Networks: Basics. Prof. Alexander Ihler

Machine Learning Lecture 10

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear & nonlinear classifiers

3.4 Linear Least-Squares Filter

Statistical Data Analysis Stat 2: Monte Carlo Method, Statistical Tests

Multivariate Analysis Techniques in HEP

Bias-Variance Tradeoff

4. Multilayer Perceptrons

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Statistical Methods for Particle Physics Lecture 2: statistical tests, multivariate methods

Neural networks. Chapter 20, Section 5 1

Artificial Neural Networks" and Nonparametric Methods" CMPSCI 383 Nov 17, 2011!

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

10-701/ Machine Learning, Fall

Linear & nonlinear classifiers

(Feed-Forward) Neural Networks Dr. Hajira Jabeen, Prof. Jens Lehmann

18.6 Regression and Classification with Linear Models

Classification with Perceptrons. Reading:

Artifical Neural Networks

Pattern Recognition and Machine Learning

Course 395: Machine Learning - Lectures

Logistic Regression. Machine Learning Fall 2018

Multilayer Neural Networks

Multilayer Neural Networks

Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning

Artificial Neural Networks. MGS Lecture 2

Machine Learning Practice Page 2 of 2 10/28/13

Linear discriminant functions

Neural Networks. Nicholas Ruozzi University of Texas at Dallas

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

ECE521 week 3: 23/26 January 2017

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Statistical Machine Learning from Data

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CSCI-567: Machine Learning (Spring 2019)

Generative v. Discriminative classifiers Intuition

Neural Networks: Introduction

Computational statistics

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Multilayer Perceptron

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Ch.6 Deep Feedforward Networks (2/3)

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

Neural Networks. Chapter 18, Section 7. TB Artificial Intelligence. Slides from AIMA 1/ 21

An Introduction to Statistical and Probabilistic Linear Models

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

NN V: The generalized delta learning rule

Learning from Data: Multi-layer Perceptrons

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 6: Multi-Layer Perceptrons I

Artificial Neural Networks

Nonlinear Classification

LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

STA 4273H: Statistical Machine Learning

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Transcription:

Advanced statistical methods for data analysis Lecture 2 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1

Outline Multivariate methods in particle physics Some general considerations Brief review of statistical formalism Multivariate classifiers: Linear discriminant function Neural networks Naive Bayes classifier k Nearest Neighbour method Decision trees Support Vector Machines Lecture 2 start 2

Linear decision boundaries A linear decision boundary is only optimal when both classes follow multivariate Gaussians with equal covariances and different means. x 2 x 1 x 2 x 1 For some other cases a linear boundary is almost useless. 3

Nonlinear transformation of inputs We can try to find a transformation, x 2 x 1,, x n 1 x,, m x so that the transformed feature space variables can be separated better by a linear boundary: Here, guess fixed 1 =tan 1 x 2 / x 1 basis functions (no free parameters) 2 = x 1 2 x 2 2 2 x 1 1 4

Neural networks Neural networks originate from attempts to model neural processes (McCulloch and Pitts, 1943; Rosenblatt, 1962). Widely used in many fields, and for many years the only advanced multivariate method popular in HEP. We can view a neural network as a specific way of parametrizing the basis functions used to define the feature space transformation. The training data are then used to adjust the parameters so that the resulting discriminant function has the best performance. 5

The single layer perceptron n Define the discriminant using y x =h w 0 i=1 w i x i where h is a nonlinear, monotonic activation function; we can use e.g. the logistic sigmoid h x = 1 e x 1. If the activation function is monotonic, the resulting y(x) is equivalent to the original linear discriminant. This is an example of a generalized linear model called the single layer perceptron. y x output node input layer 6

The multilayer perceptron Now use this idea to define not only the output y(x), but also the set of transformed inputs 1 x,, m x Superscript for weights indicates layer number that form a hidden layer : i x =h w i0 n 1 j=1 w 1 ij x j y x y x =h w 10 n 2 j=1 w 2 1 j j x inputs hidden layer i output This is the multilayer perceptron, our basic neural network model; straightforward to generalize to multiple hidden layers. 7

Network architecture: one hidden layer Theorem: An MLP with a single hidden layer having a sufficiently large number of nodes can approximate arbitrarily well the Bayes optimal decision boundary. Holds for any continuous non polynomial activation function Leshno, Lin, Pinkus and Schocken (1993), Neural Networks 6, 861 867 In practice often choose a single hidden layer and try increasing the the number of nodes until no further improvement in performance is found. 8

More than one hidden layer Relatively little is known concerning the advantages and disadvantages of using a single hidden layer with many units (neurons) over many hidden layers with fewer units. The mathematics and approximation theory of the MLP model with more than one hidden layer is not well understood. Nonetheless there seems to be reason to conjecture that the two hidden layer model may be significantly more promising than the single hidden layer model,... A. Pinkus, Approximation theory of the MLP model in neural networks, Acta Numerica (1999), pp. 143 195. 9

Network training The type of each training event is known, i.e., for event a we have: x a = x 1,, x n t a =0,1 the input variables, and a numerical label for event type ( target value ) Let w denote the set of all of the weights of the network. We can determine their optimal values by minimizing a sum of squares error function N E w = 1 2 a=1 N y x a,w t a 2 = a=1 E a w Contribution to error function from each event 10

Numerical minimization of E(w) Consider gradient descent method: from an initial guess in weight space w (1) take a small step in the direction of maximum decrease. I.e. for the step to +1, w 1 =w E w learning rate ( >0) If we do this with the full error function E(w), gradient descent does surprisingly poorly; better to use conjugate gradients. But gradient descent turns out to be useful with an online (sequential) method, i.e., where we update w for each training event a, (cycle through all training events): w 1 =w E a w 11

Error backpropagation Error backpropagation ( backprop ) is an algorithm for finding the derivatives required for gradient descent minimization. The network output can be written y(x) = h(u(x)) where u x = j=0 w 2 1 j j x, j x =h w 1 jk x k k=0 where we defined 0 = x 0 = 1 and wrote the sums over the nodes in the preceding layers starting from 0 to include the offsets. So e.g. for event a we have Chain rule gives all the needed derivatives. E a w = y 2 a t a h' u x j x 1 j derivative of activation function 12

Overtraining If the network has too many nodes, after training it will tend to conform too closely to the training data: Overtraining The classification error rate on the training sample may be very low, but it would be much higher on an independent data sample. Therefore it is important to evaluate the error rate with a statistically independent validation sample. 13

Monitoring overtraining If we monitor the value of the error function E(w) at every cycle of the minimization, for the training sample it will continue to decrease. error But the validation sample it may initially decrease, and then at some point increase, indicating overtraining. validation sample training sample training cycle 14

Validation and testing The validation sample can be used to make various choices about the network architecture, e.g., adjust the number of hidden nodes so as to obtain good generalization performance (ability to correctly classify unseen data). If the validation stage is iterated may times, the estimated error rate based on the validation sample has a bias, so strictly speaking one should finally estimate the error rate with an independent test sample. Rule of thumb if data not too expensive (Narsky): train : validate : test 50 : 25 : 25 But this depends on the type of classifier. Often the bias in the error rate from the validation sample is small and one can omit the test step. 15

Bias variance trade off For a finite amount of training data, an increasing number of network parameters (layers, nodes) means that the estimates of these parameters have increasingly large statistical errors (variance, overtraining). Having too few parameters doesn't allow the network to exploit the existing nonlinearities, i.e., it has a bias. high variance high bias good trade off 16

Regularized neural networks Often one uses the test sample to optimize the number of hidden nodes. Alternatively one may use a relatively large number of hidden nodes but include in the error function a regularization term that penalizes overfitting, e.g., regularization parameter E w =E w 2 wt w Increasing gives a smoother boundary (higher bias, lower variance) Known as weight decay, since the weights are driven to zero unless supported by the data (an example of parameter shrinkage ). 17

18

Probability Density Estimation (PDE) Construct non parametric estimators for the pdfs of the data x for the two event classes, p(x H 0 ), p(x H 1 ) and use these to construct the likelihood ratio, which we use for the discriminant function: y x = p x H 0 p x H 1 n dimensional histogram is a brute force example of this; we will see a number of ways that are much better. 19

Correlation vs. independence In a general a multivariate distribution p(x) does not factorize into a product of the marginal distributions for the individual variables: n p x = i=1 p i x i holds only if the components of x are independent Most importantly, the components of x will generally have nonzero covariances (i.e. they are correlated): V ij =cov[ x i, x j ]=E [ x i x j ] E [ x i ] E [ x j ] 0 20

Decorrelation of input variables But we can define a set of uncorrelated input variables by a linear transformation, i.e., find the matrix A such that for the covariances cov[y i, y j ] = 0: y=a x For the following suppose that the variables are decorrelated in this way for each of p(x H 0 ) and p(x H 1 ) separately (since in general their correlations are different). 21

Decorrelation is not enough But even with zero correlation, a multivariate pdf p(x) will in general have nonlinearities and thus the decorrelated variables are still not independent. x 2 pdf with zero covariance but components still not independent, since clearly p x 2 x 1 p x 1, x 2 p 1 x 1 p 2 x 2 and therefore x 1 p x 1, x 2 p 1 x 1 p 2 x 2 22

Naive Bayes But if the nonlinearities are not too great, it is reasonable to first decorrelate the inputs and take as our estimator for each pdf n p x = i=1 p i x i So this at least reduces the problem to one of finding estimates of one dimensional pdfs. The resulting estimated likelihood ratio gives the Naive Bayes classifier (in HEP sometimes called the likelihood method ). 23

Test example with TMVA 24

Test example, x vs. y with cuts on z no cut on z z < 0.75 z < 0.5 z < 0.25 25

Test example results Fisher discriminant Multilayer perceptron Naive Bayes, no decorrelation Naive Bayes with decorrelation 26

Test example ROC curves TMVA macro efficiencies.c 27

Efficiencies versus cut value Select signal by cutting on output: y > y cut Fisher discriminant TMVA macro mvaeffs.c 28

Efficiencies versus cut value Select signal by cutting on output: y > y cut Multilayer perceptron TMVA macro mvaeffs.c 29

Efficiencies versus cut value Select signal by cutting on output: y > y cut Naive Bayes, no decorrelation TMVA macro mvaeffs.c 30

Efficiencies versus cut value Select signal by cutting on output: y > y cut Naive Bayes with decorrelation TMVA macro mvaeffs.c 31

Lecture 2 summary We have generalized the classifiers to allow nonlinear decision boundaries. In neural networks, the user chooses a certain number of hidden layers and nodes; having more allows for an increasingly accurate approximation to the optimal decision boundary. But having more parameters means that their estimates given a finite amount of training data are increasingly subject to statistical fluctuations, which shows up as overtraining. The naive Bayes method seeks to approximate the joint pdfs of the classes as the product of 1 dimensional marginal pdfs (after decorrelation). To pursue this further we should therefore refine our approximations of 1 d pdfs. 32