StatPatternRecognition: A C++ Package for Multivariate Classification of HEP Data. Ilya Narsky, Caltech

Similar documents
StatPatternRecognition: User Guide. Ilya Narsky, Caltech

arxiv:physics/ v1 [physics.data-an] 20 Jul 2005

Multivariate statistical methods and data mining in particle physics

Advanced statistical methods for data analysis Lecture 1

Harrison B. Prosper. Bari Lectures

Data Mining und Maschinelles Lernen

Ensembles. Léon Bottou COS 424 4/8/2010

Algorithm-Independent Learning Issues

TDT4173 Machine Learning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Statistical Methods for Particle Physics Lecture 2: statistical tests, multivariate methods

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

CSCI-567: Machine Learning (Spring 2019)

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

CS534 Machine Learning - Spring Final Exam

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

TDT4173 Machine Learning

Statistical Methods for Particle Physics Lecture 2: multivariate methods

Chapter 14 Combining Models

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Multivariate Analysis Techniques in HEP

Support Vector Machines

Learning with multiple models. Boosting.

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Neural Networks and Ensemble Methods for Classification

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

FINAL: CS 6375 (Machine Learning) Fall 2014

Advanced statistical methods for data analysis Lecture 2

Statistics and learning: Big Data

Variance Reduction and Ensemble Methods

Bagging and Other Ensemble Methods

Multivariate Methods in Statistical Data Analysis

BAGGING PREDICTORS AND RANDOM FOREST

Analysis Techniques Multivariate Methods

1 Handling of Continuous Attributes in C4.5. Algorithm

The exam is closed book, closed notes except your one-page cheat sheet.

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

HEP Data Mining with TMVA

Statistical Machine Learning from Data

ECE 5424: Introduction to Machine Learning

PATTERN CLASSIFICATION

CS7267 MACHINE LEARNING

W vs. QCD Jet Tagging at the Large Hadron Collider

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Statistical Methods in Particle Physics

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

Motivating the Covariance Matrix

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Linear Models for Classification

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Holdout and Cross-Validation Methods Overfitting Avoidance

Evidence for Single Top Quark Production. Reinhard Schwienhorst

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Final Exam, Spring 2006

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Applied Multivariate and Longitudinal Data Analysis

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Gradient Boosting (Continued)

Final Exam, Machine Learning, Spring 2009

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Ensembles of Classifiers.

arxiv: v1 [stat.ml] 24 Nov 2016

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Machine Learning, Fall 2009: Midterm

Ensemble Methods and Random Forests

Artificial Intelligence Roman Barták

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Statistical Data Analysis Stat 2: Monte Carlo Method, Statistical Tests

Midterm Exam Solutions, Spring 2007

ABC random forest for parameter estimation. Jean-Michel Marin

UVA CS 4501: Machine Learning

VBM683 Machine Learning

Classification: The rest of the story

Introduction to Machine Learning Midterm, Tues April 8

Discriminant Analysis and Statistical Pattern Recognition

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Nearest Neighbors Methods for Support Vector Machines

Curve Fitting Re-visited, Bishop1.2.5

Final Exam, Fall 2002

CS4495/6495 Introduction to Computer Vision. 8C-L3 Support Vector Machines

Introduction to Machine Learning

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers

Learning theory. Ensemble methods. Boosting. Boosting: history

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Machine Learning Lecture 5

1 Handling of Continuous Attributes in C4.5. Algorithm

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

ECE 5984: Introduction to Machine Learning

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Lossless Online Bayesian Bagging

REGRESSION TREE CREDIBILITY MODEL

Classification Methods II: Linear and Quadratic Discrimminant Analysis

6.867 Machine Learning

Discriminative Learning and Big Data

Measurement of the. Maren Ugland. Bergen group meeting

Machine Learning. Ensemble Methods. Manfred Huber

Transcription:

StatPatternRecognition: A C++ Package for Multivariate Classification of HEP Data Ilya Narsky, Caltech

Motivation Introduction advanced classification tools in a convenient C++ package for HEP researchers Description of math formalism: I. Narsky, physics/5743, StatPatternRecognition: A C++ Package for Statistical Analysis of High Energy Physics Data I. Narsky, physics/5757, Optimization of Signal Significance by Bagging Decision Trees Practical instructions: README included in the package Ilya Narsky Phystat 25 2

Classifiers: Binary split Implemented Methods Linear and quadratic discriminant analysis Decision trees Bump hunting algorithm (PRIM, Friedman & Fisher) AdaBoost Bagging and random forest algorithms AdaBoost and Bagger are capable of boosting/bagging any classifier implemented in the package Methods of general use: Bootstrap Calculation of data moments (mean, covariance, kurtosis) Test of zero correlation for variables with a joint elliptical distribution Ilya Narsky Phystat 25 3

Linear and Quadratic Discriminant Analysis Separation of a bivariate Gaussian from uniform background by quadratic discriminant analysis. Each class density is a multivariate Gaussian f k ( x) = exp d k k µ / 2 2 k (2π ) Σ / 2 k Look at the log-ratio of two class probabilities: log f f 2 ( x) ( x) Assume => linear Fisher Don t assume => Σ 2 quadratic Fisher T ( x µ ) Σ ( x ) T ( Σ µ Σ ) x ( Σ Σ )x = C + x µ Σ T 2 2 2 = Σ 2 Σ = 2 Ilya Narsky Phystat 25 4

Decision trees in StatPatternRecognition S, B S = S +S 2 B = B +B 2 S, B S 2, B 2 StatPatternRecognition allows the user to supply an arbitrary criterion for tree optimization by providing an implementation to the abstract C++ interface. At the moment, following criteria are implemented: Conventional: Gini index, crossentropy, and correctly classified fraction of events. Physics: signal purity, S/ S+B, 9% Bayesian UL, and 2*( S+B- B). Conventional decision tree, e.g., CART: - Each split minimizes Gini index: Gini = SB S + B - The tree spends 5% of time finding clean signal nodes and 5% of time finding clean background nodes. + Decision tree in StatPatternRecognition: - Each split maximizes signal significance: Signif = max S S, + B S2B2 S + B 2 S 2 S 2 2 + B 2

Decision Trees: An Idealistic Example -5 Separation of two bivariate Gaussians from uniform background by decision trees optimizing various figures of merit: -5 - - - - true signal signal significance (top right) signal significance signal purity (bottom left) Gini index (bottom right) True signal is shown in the top left. -5-5 - - signal purity Ilya Narsky - - Gini index Phystat 25 6

Bump hunting...75.75.5.5.25.25...25.5.75. true signal...25.5.75. purity bump hunter 2 pts/node peel=.7 Algorithm finds a box in the multidimensional space in two stages: shrinkage: reduce the box size; peel parameter = max fraction of points that can be peeled off in one iteration expansion Above: The bump hunter finds two tiny signal peaks on top of uniform background (right). Full signal+background distribution (left). Ilya Narsky Phystat 25 7

Method for combining classifiers Applies classifiers sequentially: enhances weights of misclassified events computes relative weight of each classifier Iterate as long as e<.5 Final decision: weighted vote of all classifiers AdaBoost iteration iteration : K : weight of () w i = / N; i =,..., N f ( K ) ( x) : classifier ε correctly classified events : misclassified events : C f ( x) = β K = K f ( K ) ( x) K K : = w w ( K ) i misclassified β ( K ) i ( K ) i K w < ( K ) wi = 2( ε K ) ( K ) wi = 2ε K ε K = log ε K.5 Ilya Narsky Phystat 25 8

Bagging (Bootstrap AGGregatING) Algorithm: draw a bootstrap replica of the training sample train a new classifier on this replica classify new data by the majority vote of the built classifiers Parallel algorithm: uses classifiers built on independently-drawn bootstrap replicas Random forest: A more generic version of bagging sample not only from training events but also from input dimensions for now SPR bootstraps all input dimensions (you can t choose how many you include)

AdaBoost and Bagging: An Idealistic Example 8 4 - -. true signal accepted signal Separation of two bivariate Gaussians from uniform background by boosted binary splits. Background is uniform on -<x<; -<y<. Ilya Narsky. classifier output - -.5 - - accepted background Separation of two bivariate Gaussians from uniform background by bagged trees. Background is uniform on -<x<; -<y<. Phystat 25

Why boosted and bagged decision trees are good? The best off-the-shelf method for classification in many dimensions. Unlike kernel-based methods, methods based on local averaging and nearest-neighbor methods, trees do not suffer from the curse of dimensionality. CPU time generally scales linearly with dimensionality. Training mechanism is robust. Can deal with strongly correlated inputs mixed input types (continuous and discrete) Ilya Narsky Phystat 25

Bbar B Physics Examples - All results shown for test data. Separation of B gln from combined background at BaBar by bagged decision trees optimizing the signal significance with input variables. Time needed to train bagged trees on 5k events = several hours. Gives a 25% improvement in the signal significance compared to the method originally used by the analysts. 5 signal background B/Bbar tagging at BaBar by boosted decision trees with 35 input variables. Time needed to train 5 boosted trees on 5k events = + day. At the moment, gives a somewhat worse B/Bbar separation than the clever algorithm that builds 9 neural nets on low-dimensional subtaggers and then combines them. But first results showed promise. 4 3 2 - -5 5 classifier output

Physics Examples - 2 (D data from Harrison Prosper) All results shown for test data. ppbar Wbb ppbar tqb Separation of ppbar tqb from ppbar Wbb by bagged decision trees in dimensions. Time needed to train 2 decision trees on 5 events = several minutes. ppbar Wbb ppbar tqb Separation of ppbar tqb from ppbar Wbb by boosted binary splits in dimensions. Time needed to train binary splits on 5 events = seconds.

Installation and use outside BaBar Send email to narsky@hep.caltech.edu Get a copy of the package Edit SprExperiment.hh to remove reference to BaBar.hh (one line) Provide a Makefile that resolves references to CLHEP headers and CERN libraries Either provide your own implementation of SprTupleWriter (depends on BaBar-specific histogramming utils) or replace it with SprAsciiWriter provided in the package You are ready to go! Feedback is very much appreciated Ilya Narsky Phystat 25 4