NON-TRIVIAL APPLICATIONS OF BOOSTING:

Similar documents
Stochastic Gradient Descent

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Advanced event reweighting using multivariate analysis

Multivariate statistical methods and data mining in particle physics

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Harrison B. Prosper. Bari Lectures

Learning with multiple models. Boosting.

CS7267 MACHINE LEARNING

VBM683 Machine Learning

Gradient Boosting (Continued)

Chapter 14 Combining Models

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Voting (Ensemble Methods)

ECE 5424: Introduction to Machine Learning

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Machine Learning. Ensemble Methods. Manfred Huber

Multivariate Analysis Techniques in HEP

Advanced statistical methods for data analysis Lecture 1

Multivariate Methods in Statistical Data Analysis

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLĂ„RNING

Decision Trees: Overfitting

Holdout and Cross-Validation Methods Overfitting Avoidance

Online Advertising is Big Business

Multivariate Analysis for Setting the Limit with the ATLAS Detector

Tuning the simulated response of the CMS detector to b-jets using Machine learning algorithms

Statistics and learning: Big Data

Ensemble Methods and Random Forests

Motivating the Covariance Matrix

Recitation 9. Gradient Boosting. Brett Bernstein. March 30, CDS at NYU. Brett Bernstein (CDS at NYU) Recitation 9 March 30, / 14

Face detection and recognition. Detection Recognition Sally

Prediction of Citations for Academic Papers

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Harrison B. Prosper. Bari Lectures

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Advanced Machine Learning Practical 4b Solution: Regression (BLR, GPR & Gradient Boosting)

Logistic Regression. Machine Learning Fall 2018

Announcements Kevin Jamieson

Gradient Boosting, Continued

Learning theory. Ensemble methods. Boosting. Boosting: history

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Classification and Pattern Recognition

Machine Learning Concepts in Chemoinformatics

Machine Learning Linear Classification. Prof. Matteo Matteucci

Evaluation. Albert Bifet. April 2012

Evaluation. Andrea Passerini Machine Learning. Evaluation

Statistical Machine Learning from Data

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Performance Evaluation

Evaluation requires to define performance measures to be optimized

Support Vector Machine, Random Forests, Boosting Based in part on slides from textbook, slides of Susan Holmes. December 2, 2012

Bagging and Other Ensemble Methods

Performance Evaluation

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Linear Classifiers: Expressiveness

Ensemble Methods for Machine Learning

Neural Networks and Ensemble Methods for Classification

Introduction to Signal Detection and Classification. Phani Chavali

Variance Reduction and Ensemble Methods

Machine Learning Lecture 7

Adaptive Crowdsourcing via EM with Prior

Introduction to Supervised Learning. Performance Evaluation

COMS 4771 Lecture Boosting 1 / 16

Part I Week 7 Based in part on slides from textbook, slides of Susan Holmes

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Machine Learning Lecture 5

Online Advertising is Big Business

PDEEC Machine Learning 2016/17

Search for the very rare decays B s/d + - at LHCb

Statistical Tools in Collider Experiments. Multivariate analysis in high energy physics

Sparse Kernel Machines - SVM

Hypothesis Evaluation

ECE 5984: Introduction to Machine Learning

Machine Learning Linear Models

Chart types and when to use them

Decision Trees. Machine Learning CSEP546 Carlos Guestrin University of Washington. February 3, 2014

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA

PCA FACE RECOGNITION

CSE446: non-parametric methods Spring 2017

Lecture 9: Classification, LDA

Outline: Ensemble Learning. Ensemble Learning. The Wisdom of Crowds. The Wisdom of Crowds - Really? Crowd wiser than any individual

A Brief Introduction to Adaboost

Data Mining und Maschinelles Lernen

Applied Multivariate and Longitudinal Data Analysis

Lecture 9: Classification, LDA

the tree till a class assignment is reached

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Lecture 9: Classification, LDA

StatPatternRecognition: A C++ Package for Multivariate Classification of HEP Data. Ilya Narsky, Caltech

W vs. QCD Jet Tagging at the Large Hadron Collider

Machine Learning & Data Mining

6.867 Machine Learning

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

CLASSIFICATION METHODS FOR MAGIC TELESCOPE IMAGES ON A PIXEL-BY-PIXEL BASE. 1 Introduction

Transcription:

NON-TRIVIAL APPLICATIONS OF BOOSTING: REWEIGHTING DISTRIBUTIONS WITH BDT, BOOSTING TO UNIFORMITY Alex Rogozhnikov YSDA, NRU HSE Heavy Flavour Data Mining workshop, Zurich, 2016 1

BOOSTING a family of machine learning algorithms which convert weak learners to strong ones usually built over decision trees state-of-art results in many areas usually general-purpose implementations are used for classification and regression how can we get more by adapting boosting for our applications? 2

ADABOOST (ADAPTIVE BOOSTING) the first famous example of boosting building weak learners one-by-one, predictions are summed up: D(x) = j α j d j (x) each time increase weights of events incorrectly classified by tree d(x) w i w i exp ( αy i d( x i ))), y i = ±1 main idea: provide base estimator (weak learner) with information about which samples have higher importance 3

REWEIGHTING OF DISTRIBUTIONS Reweighting in HEP is used to minimize difference between real data (RD) and Monte-Carlo (MC) simulation. Known process is used, for which real data can be obtained. Goal of reweighting: assign weights to MC s.t. MC and RD distributions coincide: 4

APPLICATIONS BEYOND PHYSICS Introducing corrections to fight non-response bias: assigning higher weight to answers from groups with low response. See e.g. R. Kizilcec, "Reducing non-response bias with survey reweighting: Applications for online learning researchers", 2014. I'll talk in physical terms of original (MC) and target (RD) distributions, since this is applicable to any reweighting. 5

TYPICAL APPROACH Usually histogram reweighting is used: the variable(s) is splitted into bins, in each bin the weight of original distribution is multiplied by: w target, bin w original, bin, original distributions. multiplier bin = w target, bin w original, bin total weight of events in bin for target and 1. Simple and fast! 2. Very few (typically, one or two) variables 3. Reweighting one variable may bring disagreement in others 4. Which variable to use in reweighting? 6

COMPARING DISTRIBUTIONS i.e. to check the quality of reweighting one dimension: KS-test, CvM, Mann-Whitney (U-test) two or more dimensions? comparing 1d projections is not the way: 7

COMPARING DISTRIBUTIONS USING ML Final goal: define if the classifier discriminates RD and MC. Comparison shall be done using ML, output of classifier is 1-dimensional. Looking at ROC curve on a holdout: See also: J. Friedman, On Multivariate Goodness of Fit and Two Sample Testing, 2003 8

REWEIGHTING WITH HISTOGRAMS: EXAMPLE Problems arise when there are too few events in bin. Those can be detected on a holdout. Issues: 1. few bins rule is rough 2. many bins rule is not reliable 9

REUSING ML CLASSIFIERS TO REWEIGHTING We need to estimate density ratio f 1 (x) f 0 (x) Classifier trained to discriminate MC and RD should reconstruct probabilities (x) and (x). p 0 p 1 So, for reweghting we can use p 1 (x) p 0 (x) f 1 (x) f 0 (x) able to reweight in many variables successfully tried in HEP, see D. Martschei et al, "Advanced event reweighting using multivariate analysis", 2012 poor reconstruction when ratio is too small / high slow 10

BETTER IDEA: Write ML algorithm to solve directly reweighting problem 1. Split space of variables in several large regions 2. Find this regions 'intellectually'! USING DECISION TREE TO FIND REGIONS 1. Tree splits the space of variables by orthogonal splits 2. Finding regions with high difference by maximizing symmetrized χ 2 : χ 2 ( w leaf, original w leaf, target ) 2 = leaf + w leaf, original w leaf, target 11

SYMMETRIZED BINNED χ 2 Finding optimal threshold to split variable into two bins: 12

BDT REWEIGHTER To train reweighter many times repeat following steps: 1. build a shallow tree to maximize symmetrized 2. compute predictions in leaves: leaf_pred = log w leaf, target 3. reweight distributions (compare with AdaBoost): w ={ w, w e pred, w leaf, original CHANGES COMPARED TO GBDT tree splitting criterion (MSE χ 2 ) different boosting procedure χ 2 if event from target (RD) distribution if event from original (MC) distribution 13

DEMONSTRATION original result of BDT reweighter 14

KS DISTANCES AFTER REWEIGHTING Bins reweighter uses only 2 last variables ( 60 60 bins); ML and BDT use all variables Feature original bins reuse ML BDT reweighter Bplus_IPCHI2_OWNPV 0.0796 0.0642 0.0463 0.0028 Bplus_ENDVERTEX_CHI2 0.0094 0.0175 0.0490 0.0021 Bplus_PT 0.0586 0.0679 0.0126 0.0053 Bplus_P 0.1093 0.1126 0.0044 0.0047 Bplus_TAU 0.0037 0.0060 0.0324 0.0044 mu_min_pt 0.0623 0.0604 0.0017 0.0036 mu_max_pt 0.0483 0.0561 0.0053 0.0035 mu_max_p 0.0906 0.0941 0.0084 0.0036 mu_min_p 0.0845 0.0858 0.0058 0.0043 mu_max_track_chi2ndof 0.0956 0.0042 0.0128 0.0043 nspdhits 0.2478 0.0098 0.0180 0.0075 15

COMPARING RESULTS WITH ML Using approach to distribution comparison described earlier. Reweighting two variables wasn't enough: 16

FEATURE IMPORTANCES Being a variation of GBDT, BDT reweighter is able to calculate feature importances. Two features used in reweighting with bins are indeed most important 17

BOOSTED REWEIGHTING uses each time few large bins is able to handle many variables requires less data (for same performance)... but slow SUMMARY ABOUT REWEIGHTING 1. Comparison of multidimensional distributions is ML problem 2. Reweighting of distributions is ML problem 18

RECALL: GRADIENT BOOSTING Gradient boosting greedily builds an ensemble of estimators (regressors) D(x) = j α j d j (x) by optimizing some loss function. Those could be: MSE: = i ( y i D( x i )) 2 AdaLoss: = i e y i D( x i ), y i = ±1 LogLoss: = i log(1 + e y i D( x i ) ), = ±1 Next estimator in series approximates gradient of loss in the space of functions. y i 19

BOOSTING TO UNIFORMITY Point of uniform boosting have constant efficiency (FPR/TPR) against some variable. Examples: flat background efficiency along mass flat signal efficiency for different flight time flat signal efficiency along Dalitz variables 20

EXAMPLE: NON-FLAT BACKGROUND EFFICIENCY (FPR) ALONG MASS High correlation with mass will create from pure background false peaking signal (specially if we use mass sidebands for training) Goal: FPR = const for different regions in mass. Reminder: FPR background efficiency 21

BASIC APPROACH reduce the number of features used in training leave only the set of features, which do not give enough information to reconstruct mass of particle simple and works sometimes we have to lose much information Can we modify ML to use all features, but provide uniform background efficiency (FPR) along mass? 22

uboostbdt Aims to get FPR region = const. fix target efficiency (say threshold FPR target = 30%), find corresponding train a tree, its decision function d(x) increase weight for misclassification: w i w i exp( αy i d(x))) increase weight of signal events in regions with high FPR w i w i FPR region FPR target exp(β( )) FPR region This way we achieve threshold on training dataset. = 30% in all regions only for some 23

uboost uboost is an ensemble of uboostbdts, each uboostbdt uses own FPR target (all possible FPRs with step of 1%). uboostbdt returns 0 or 1 (passed or not the threshold corresponding to FPR target ), simple averaging is used to obtain predictions. drives to uniform selection very complex training many trees estimation of threshold in uboostbdt may be biased 24

MEASURING NON-UNIFORMITY difference in efficiency can be detected by analyzing distributions uniformity no dependence between mass and predictions Uniform predictions Non-uniform predictions (peak in highlighted region) 25

MEASURING NON-UNIFORMITY Sum up contributions from different regions in mass CvM = F region (s) F global (s) 2 d Fglobal (s) region 26

MINIMIZING NON-UNIFORMITY why not minimizing CvM inside with GB?... because we can't compute gradient ROC AUC, classification accuracy are not differentiable too also, minimizing CvM doesn't encounter classification problem: the minimum of CvM is achieved i.e. on a classifier with random predictions 27

FLATNESS LOSS (FL) Put an additional term in loss function which will penalize for nonuniformity = adaloss + c FL Flatness loss approximates (non-differentiable) CvM metrics: FL = F (s) ds region region F global (s) 2 2( (s) (s)) D( ) FL F region F global x i s=d( xi) 28

EXAMPLE (EFFICIENCY OVER BACKGROUND) when we train on a sideband using many features, we easily can run into problems all models use same features for discrimination, but AdaBoost got serious correlation with mass 29

BOOSTING SUMMARY powerful general-purpose algorithm most known applications: classification, regression and ranking widely used, considered to be well-studied can be adapted to different specific scientific problems EXAMPLES DEMONSTRATED TODAY: 1. reweighting of multidimensional distributions 2. boosting to uniformity These algorithms are available in hep_ml package (based on sklearn) 30

REFERENCES D. Martschei, M. Feindt, S. Honc, J. Wagner-Kuhr, Advanced event reweighting using multivariate analysis Reweighting with Boosted Decision Trees J. Stevens, M. Williams, uboost: A boosting method for producing uniform selection efficiencies from multivariate classifiers A. Rogozhnikov, A. Bukva, V. Gligorov, A. Ustyuzhanin, M. Williams, New approaches for boosting to uniformity Implementations of algorithms: github repository 31

32