Advanced statistical methods for data analysis Lecture 1

Similar documents
Multivariate statistical methods and data mining in particle physics

Statistical Methods for LHC Physics

Advanced statistical methods for data analysis Lecture 2

Statistical Methods for Particle Physics Lecture 2: multivariate methods

Statistics for the LHC Lecture 1: Introduction

Statistical Methods for Particle Physics (I)

Lecture 2. G. Cowan Lectures on Statistical Data Analysis Lecture 2 page 1

Statistical Methods for Particle Physics Lecture 2: statistical tests, multivariate methods

Statistical Data Analysis Stat 2: Monte Carlo Method, Statistical Tests

Statistical Methods for Particle Physics Lecture 1: parameter estimation, statistical tests

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Statistics for the LHC Lecture 2: Discovery

Systematic uncertainties in statistical data analysis for particle physics. DESY Seminar Hamburg, 31 March, 2009

Statistical Methods in Particle Physics

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Introduction to Statistical Methods for High Energy Physics

Motivating the Covariance Matrix

Machine Learning Lecture 7

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Some Statistical Tools for Particle Physics

Linear Regression and Discrimination

Machine Learning Lecture 5

Hypothesis testing (cont d)

Statistical Methods in Particle Physics Lecture 1: Bayesian methods

CS 195-5: Machine Learning Problem Set 1

StatPatternRecognition: A C++ Package for Multivariate Classification of HEP Data. Ilya Narsky, Caltech

Machine Learning Lecture 2

Support Vector Machines

Statistical Methods in Particle Physics Lecture 2: Limits and Discovery

Primer on statistics:

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Applied Statistics. Multivariate Analysis - part II. Troels C. Petersen (NBI) Statistics is merely a quantization of common sense 1

Linear Models for Classification

Analysis Techniques Multivariate Methods

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Learning Methods for Linear Detectors

Statistical and Learning Techniques in Computer Vision Lecture 2: Maximum Likelihood and Bayesian Estimation Jens Rittscher and Chuck Stewart

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

Statistical Methods for Particle Physics Tutorial on multivariate methods

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Top quark pair cross section measurements at the Tevatron experiments and ATLAS. Flera Rizatdinova (Oklahoma State University)

5. Discriminant analysis

Thesis. Wei Tang. 1 Abstract 3. 3 Experiment Background The Large Hadron Collider The ATLAS Detector... 4

EEL 851: Biometrics. An Overview of Statistical Pattern Recognition EEL 851 1

Statistics for Particle Physics. Kyle Cranmer. New York University. Kyle Cranmer (NYU) CERN Academic Training, Feb 2-5, 2009

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Minimum Error-Rate Discriminant

Linear Classifiers as Pattern Detectors

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Statistical Tools in Collider Experiments. Multivariate analysis in high energy physics

Optimizing Selection and Sensitivity Results for VV->lvqq, 6.5 pb -1, 13 TeV Data

Linear & nonlinear classifiers

Introduction to Statistical Inference

Evidence for Single Top Quark Production. Reinhard Schwienhorst

Machine Learning for Signal Processing Bayes Classification and Regression

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

Naïve Bayes classification

Characterization of Jet Charge at the LHC

Support Vector Machines (SVM) in bioinformatics. Day 1: Introduction to SVM

An Introduction to Statistical and Probabilistic Linear Models

La ricerca dell Higgs Standard Model a CDF

Machine Learning Lecture 2

Bayesian Learning (II)

FYST17 Lecture 8 Statistics and hypothesis testing. Thanks to T. Petersen, S. Maschiocci, G. Cowan, L. Lyons

Testing Theories in Particle Physics Using Maximum Likelihood and Adaptive Bin Allocation

Hypothesis testing:power, test statistic CMS:

Measurement of t-channel single top quark production in pp collisions

Single top quark production at CDF

Topics in statistical data analysis for high-energy physics

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Linear & nonlinear classifiers

The search for standard model Higgs boson in ZH l + l b b channel at DØ

STA 4273H: Statistical Machine Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Machine Learning

Ch 4. Linear Models for Classification

Machine learning for pervasive systems Classification in high-dimensional spaces

Pattern Recognition and Machine Learning

How to find a Higgs boson. Jonathan Hays QMUL 12 th October 2012

Intelligent Systems Discriminative Learning, Neural Networks

Bayesian decision theory Introduction to Pattern Recognition. Lectures 4 and 5: Bayesian decision theory

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Introduction to Machine Learning

Non-parametric Methods

Curve Fitting Re-visited, Bishop1.2.5

Multivariate Methods in Statistical Data Analysis

Support Vector Machine (SVM) & Kernel CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

PoS(ICHEP2012)238. Search for B 0 s µ + µ and other exclusive B decays with the ATLAS detector. Paolo Iengo

BAYESIAN DECISION THEORY

Classification techniques focus on Discriminant Analysis

Support Vector Machine (SVM) and Kernel Methods

Transcription:

Advanced statistical methods for data analysis Lecture 1 RHUL Physics www.pp.rhul.ac.uk/~cowan Universität Mainz Klausurtagung des GK Eichtheorien exp. Tests... Bullay/Mosel 15 17 September, 2008 1

Outline Multivariate methods in particle physics Some general considerations Brief review of statistical formalism Multivariate classifiers: Linear discriminant function Neural networks Naive Bayes classifier Kernel based methods k Nearest Neighbour Decision trees Support Vector Machines 2

Resources on multivariate methods Books: C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006 T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer, 2001 R. Duda, P. Hart, D. Stork, Pattern Classification, 2nd ed., Wiley, 2001 A. Webb, Statistical Pattern Recognition, 2nd ed., Wiley, 2002 Materials from some recent meetings: PHYSTAT conference series (2002, 2003, 2005, 2007,...) see www.phystat.org Caltech workshop on multivariate analysis, 11 February, 2008 indico.cern.ch/conferencedisplay.py?confid=27385 SLAC Lectures on Machine Learning by Ilya Narsky (2006) www-group.slac.stanford.edu/sluo/lectures/stat2006_lectures.html 3

Software TMVA, Höcker, Stelzer, Tegenfeldt, Voss, Voss, physics/0703039 From tmva.sourceforge.net, also distributed with ROOT Variety of classifiers Good manual StatPatternRecognition, I. Narsky, physics/0507143 Further info from www.hep.caltech.edu/~narsky/spr.html Also wide variety of methods, many complementary to TMVA Currently appears project no longer to be supported 4

Data analysis at the LHC The LHC experiments are expensive ~ $1010 (accelerator and experiments) the competition is intense (ATLAS vs. CMS) vs. Tevatron and the stakes are high: 4 sigma effect 5 sigma effect So there is a strong motivation to extract all possible information from the data. 5

The Large Hadron Collider Counter rotating proton beams in 27 km circumference ring pp centre of mass energy 14 TeV Detectors at 4 pp collision points: ATLAS general purpose CMS LHCb (b physics) ALICE (heavy ion physics) 6

The ATLAS detector 2100 physicists 37 countries 167 universities/labs 25 m diameter 46 m length 7000 tonnes ~108 electronic channels 7

A simulated SUSY event in ATLAS high pt jets high pt of hadrons muons p p missing transverse energy 8

Background events This event from Standard Model ttbar production also has high pt jets and muons, and some missing transverse energy. can easily mimic a SUSY event. 9

Statement of the problem Suppose for each event we measure a set of numbers x = x 1,, x n x1 = jet pt x2 = missing energy x3 = particle i.d. measure,... x follows some n dimensional joint probability density, which depends on the type of event produced, i.e., was it pp t t, pp g g, p x H 1 xj E.g. hypotheses (class labels) H0, H1,... Often simply signal, background p x H 0 xi We want to separate (classify) the event types in a way that exploits the information carried in many variables. 10

Finding an optimal decision boundary Maybe select events with cuts : xi < ci xj < cj H0 H1 Or maybe use some other type of decision boundary: H0 H1 H0 H1 Goal of multivariate analysis is to do this in an optimal way. 11

General considerations In all multivariate analyses we must consider e.g. Choice of variables to use Functional form of decision boundary (type of classifier) Computational issues Trade off between sensitivity and complexity Trade off between statistical and systematic uncertainty Our choices can depend on goals of the analysis, e.g., Event selection for further study Searches for new event types 12

Probability quick review Frequentist (A = outcome of repeatable observation): Subjective (A = hypothesis): Conditional probability: Bayes' theorem: 13

Test statistics The decision boundary is a surface in the n dimensional space of input variables, e.g., y x =const. We can treat the y(x) as a scalar test statistic or discriminating function, and try to define this function so that its distribution has the maximum possible separation between the event types: The decision boundary is now effectively a single cut on y(x), dividing x space into two f y H 0 regions: R0 (accept H0) R1 (reject H0) y cut f y H 1 y x 14

Classification viewed as a statistical test Probability to reject H0 if it is true (type I error): = f y H 0 dy R1 = significance level, size of test, false discovery rate f y H 1 dy Probability to accept H0 if H1 is true (type II error): = = power of test with respect to H1 R0 Equivalently if e.g. H0 = background, H1 = signal, use efficiencies: s= f y H 1 dy=1 =power R1 b= f y H 0 dy=1 R0 15

Purity / misclassification rate Consider the probability that an event assigned to a certain category is classified correctly (i.e., the purity). Use Bayes' theorem: Here R1 is signal region prior probability posterior probability N.B. purity depends on the prior probabilities for an event to be signal or background (~s, b cross sections). 16

ROC curve We can characterize the quality of a classification procedure with the receiver operating characteristic (ROC curve) 1.0 better good background rejection = 1 background eff. 0.0 0.0 1.0 signal efficiency Independent of prior probabilities. Area under ROC curve can be used as a measure of quality (but usually interested in a specific or narrow range of efficiencies). 17

Constructing a test statistic The Neyman Pearson lemma states: to obtain the highest background rejection for a given signal efficiency (highest power for a given significance level), choose the acceptance region for signal such that p x s c p x b where c is a constant that determines the signal efficiency. Equivalently, the optimal discriminating function is given by the likelihood ratio: p x s y x = p x b N.B. any monotonic function of this is just as good. 18

Bayes optimal analysis From Bayes' theorem we can compute the posterior odds: p s x p x s p s = p b x p x b p b posterior odds likelihood ratio prior odds which is proportional to the likelihood ratio. So placing a cut on the likelihood ratio is equivalent to ensuring a minimum posterior odds ratio for the selected sample. 19

Purity vs. efficiency trade off The actual choice of signal efficiency (and thus purity) will depend on goal of analysis, e.g., Trigger selection (high efficiency) Event sample used for precision measurement (high purity) Measurement of signal cross section: maximize s / s b Discovery of signal: maximize expected significance ~ s / b 20

Neyman Pearson doesn't always help The problem is that we usually don't have explicit formulae for the pdfs p(x s), p(x b), so for a given x we can't evaluate the likelihood ratio. Instead we may have Monte Carlo models for signal and background processes, so we can produce simulated data: training data x 1,, x N generate x ~ p x s events of known type s generate x ~ p x b x 1,, x N b Naive try: enter each (s,b) event into an n dimensional histogram, use e.g. M bins for each of the n dimensions, total of Mn cells. n is potentially large prohibitively large number of cells to populate, can't generate enough training data. 21

Strategies for event classification A compromise solution is to make an Ansatz for the form of the test statistic y(x) with fewer parameters; determine them (using MC) to give best discrimination between signal and background. Alternatively, try to estimate the probability densities p(x s), p(x b) and use the estimated pdfs to compute the likelihood ratio at the desired x values (for real data). 22

Linear test statistic n Ansatz: y x = w i x i = w T x i=1 Choose the parameters w1,..., wn so that the pdfs f y s, f y b have maximum separation. We want: f (y) large distance between mean values, small widths ts s tb b y Fisher: maximize J w = s b 2 s2 2b 23

Coefficients for maximum separation We have k i = x i p x H k d x mean, covariance of x V k ij = x k i x k j p x H k d x where k = 0, 1 and i, j = 1,..., n (component of x) (hypothesis) For the mean and variance of y x we find T k = y x p x H k d x = w k 2k = y x k s p x H k d x = w T V k w 24

Determining the coefficients w The numerator of J(w) is n 0 1 2= wi w j 0 1 i 0 1 j i, j=1 n between classes = w i w j Bij = w T B w i, j=1 and the denominator is n = 2 0 maximize 2 1 i, j=1 within classes T wi w j V 0 V 1 ij = w W w w T B w separation between classes J w = = T separation within classes w W w 25

Fisher discriminant function Setting J =0 wi gives Fisher s linear discriminant function: y x = w T x 1 with w W 0 1 Gives linear decision boundary. H0 Projection of points in direction of decision boundary gives maximum separation. H1 26

Comment on least squares We obtain equivalent separation between the classes if we multiply the wi by a common scale factor and add an offset w0: n y x =w 0 w i x i i=1 Thus we can fix the mean values t0 and t1 for the two classes to arbitrary values, e.g., 0 and 1. Then maximizing 2 2 0 2 1 J w = 0 1 / 02 12=E 0 [ y 0 2 ] E 1 [ y 1 2 ] means minimizing Maximizing Fisher s J(w) least squares Estimate expectation values with 1 2 averages of training (e.g. MC) data: E k [ y k ] Nk Nk yi k i=1 2 27

Fisher discriminant for Gaussian data Suppose f(x Hk) is a multivariate Gaussian with mean values E 0 [ x ]= 0 for H 0 E 1 [ x ]= 1 for H 1 and covariance matrices V0 = V1 = V for both. We can write the Fisher's discriminant function (with an offset) is y x =w 0 0 1 V 1 x The likelihood ratio is thus p x H 0 1 1 T 1 T 1 =exp[ x 0 V x 1 V x 1 ] p x H 1 2 2 =e y 28

Fisher for Gaussian data (2) That is, y(x) is a monotonic function of the likelihood ratio, so for this case the Fisher discriminant is equivalent to using the likelihood ratio, and is therefore optimal. For non Gaussian data this no longer holds, but linear discriminant function may be simplest practical solution. Often try to transform data so as to better approximate Gaussian before constructing Fisher discrimimant. 29

Fisher and Gaussian data (3) Multivariate Gaussian data with equal covariance matrices also gives a simple expression for posterior probabilities, e.g., p x H 0 P H 0 P H 0 x = p x H 0 P H 0 p x H 1 P H 1 For Gaussian x and a particular choice of the offset w0 this becomes: P H 0 x = 1 s y x y x 1 e s(y) which is the logistic sigmoid function: (We will use this later in connection with Neural Networks.) y 30

Transformation of inputs If the data are not Gaussian with equal covariance, a linear decision boundary is not optimal. But we can try to subject the data to a transformation 1 x,, m x and then treat the i as the new input variables. This is often called feature space and the i are basis functions. The basis functions can be fixed or can contain adjustable parameters which we optimize with training data (cf. neural networks). In other cases we will see that the basis functions only enter as dot products x i x j =K x i, x j and thus we will only need the kernel function K(xi, xj) 31

Lecture 1 Summary The optimal solution to our classification problem should be to use the likelihood ratio as the test variable unfortunately not possible. So far we have seen a simple linear ansatz for the test statistic, which is optimal only for Gaussian data (with equal covariance). If the data are not Gaussian, we can transform to feature space 1 x,, m x Next time we will consider this to obtain nonlinear test statistics such as neural networks. 32