Weighted Classification Cascades for Optimizing Discovery Significance

Similar documents
How Machine Learning won the Higgs Boson Challenge

Discovery significance with statistical uncertainty in the background estimate

Decision Trees: Overfitting

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Thesis. Wei Tang. 1 Abstract 3. 3 Experiment Background The Large Hadron Collider The ATLAS Detector... 4

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Minimum Power for PCL

Classification of Higgs Boson Tau-Tau decays using GPU accelerated Neural Networks

Stochastic Gradient Descent

CS229 Supplemental Lecture notes

Qualifying Exam in Machine Learning

Gradient Boosting, Continued

Statistical Methods for Particle Physics Tutorial on multivariate methods

Statistical Methods for Particle Physics Lecture 3: Systematics, nuisance parameters

Statistical Machine Learning from Data

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Recent developments in statistical methods for particle physics

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

A first model of learning

9 Classification. 9.1 Linear Classifiers

Neural networks and support vector machines

ECS171: Machine Learning

Online Learning and Sequential Decision Making

ECE 5984: Introduction to Machine Learning

Statistics Challenges in High Energy Physics Search Experiments

Linear classifiers: Logistic regression

Midterm, Fall 2003

Machine Learning Lecture 5

Announcements Kevin Jamieson

10-701/ Machine Learning - Midterm Exam, Fall 2010

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Final Exam, Fall 2002

Surrogate loss functions, divergences and decentralized detection

PDEEC Machine Learning 2016/17

Voting (Ensemble Methods)

Discovery Potential for the Standard Model Higgs at ATLAS

Introduction to Logistic Regression and Support Vector Machine

Empirical Risk Minimization

Hypothesis testing:power, test statistic CMS:

A Brief Introduction to Adaboost

Support Vector Machines

Support vector machines Lecture 4

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Gradient Boosting (Continued)

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Evaluation. Andrea Passerini Machine Learning. Evaluation

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Machine Learning for NLP

Statistics for the LHC Lecture 1: Introduction

Characterization of Jet Charge at the LHC

Evaluation requires to define performance measures to be optimized

Learning theory. Ensemble methods. Boosting. Boosting: history

FINAL: CS 6375 (Machine Learning) Fall 2014

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Ensembles. Léon Bottou COS 424 4/8/2010

CSC 411: Lecture 04: Logistic Regression

18.6 Regression and Classification with Linear Models

A Simple Algorithm for Learning Stable Machines

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Some Statistical Tools for Particle Physics

Second Workshop, Third Summary

Bias-Variance Tradeoff

Minimax risk bounds for linear threshold functions

Computational statistics

Lecture 8. Instructor: Haipeng Luo

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Multi-Plant Photovoltaic Energy Forecasting Challenge with Regression Tree Ensembles and Hourly Average Forecasts

Boos$ng Can we make dumb learners smart?

Data Mining und Maschinelles Lernen

Midterm Exam Solutions, Spring 2007

Statistical Methods for Particle Physics (I)

CSC 411 Lecture 6: Linear Regression

Lecture 4 Logistic Regression

Statistical Methods for Particle Physics Lecture 4: discovery, exclusion limits

Logistic Regression. William Cohen

CS534 Machine Learning - Spring Final Exam

CSC2515 Winter 2015 Introduction to Machine Learning. Lecture 2: Linear regression

Classification Based on Probability

COMS 4771 Introduction to Machine Learning. Nakul Verma

ECS171: Machine Learning

Machine Learning

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Distirbutional robustness, regularizing variance, and adversaries

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM

Decision Support Systems MEIC - Alameda 2010/2011. Homework #8. Due date: 5.Dec.2011

Empirical Risk Minimization Algorithms

Learning with multiple models. Boosting.

CPSC 340 Assignment 4 (due November 17 ATE)

Machine Learning Lecture 7

Summary of Columbia REU Research. ATLAS: Summer 2014

Introduction to Machine Learning Lecture 11. Mehryar Mohri Courant Institute and Google Research

Generative Adversarial Networks

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Generative v. Discriminative classifiers Intuition

Transcription:

Weighted Classification Cascades for Optimizing Discovery Significance Lester Mackey Collaborators: Jordan Bryan and Man Yue Mo Stanford University December 13, 2014 Mackey (Stanford) Weighted Classification Cascades December 13, 2014 1 / 15

Background Hypothesis Testing in High-Energy Physics Goal: Given a collection of events (high-energy particle collisions) and a definition of interesting (e.g., Higgs boson produced), detect whether any interesting events occurred Interesting events = signal events Other events (e.g., no Higgs produced) = background events Why? To test predictions of physical models Standard Model of physics predicts existence of elementary particles and various modes of particle decay Claim: Higgs bosons exist and often decay into tau particles To substantiate claim experimentally, must distinguish Higgs to tau tau decay events (signal events) Other events with similar characteristics (background events) Mackey (Stanford) Weighted Classification Cascades December 13, 2014 2 / 15

Background Hypothesis Testing in High-Energy Physics Goal: Given a collection of events (high-energy particle collisions), test whether any signal events occurred How? Event represented as features (momenta and energy) of particles produced by collision Ideally: Test based on distributions of signal and background Signal and background event distributions complex and difficult to characterize explicitly: hinders development of analytical test Identify relatively signal-rich selection region by training classifier on labeled training data Test new dataset for signal by counting events in selection region and computing (approximate) significance value or p-value under Poisson likelihood ratio test Mackey (Stanford) Weighted Classification Cascades December 13, 2014 3 / 15

Background Approximate Median Significance (AMS) How to estimate significance of new event data? Dataset D = {(x 1, y 1 ),..., (x n, y n )} with event feature vectors x i X and labels y i { 1, 1} = {background, signal} Classifier g : X { 1, 1} assigning labels to events x X True positive count s D (g) = n i=1 I[g(x i) = 1, y i = 1] False positive count b D (g) = n i=1 I[g(x i) = 1, y i = 1] Approximate Median Significance (AMS) (Cowan et al., 2011) ( ( ) ) sd (g) + b D (g) AMS 2 (g, D) = 2 (s D (g) + b D (g)) log s D (g) b D (g) Approximates 1 p-value quantile of Poisson model test statistic Measures significance in units of standard deviation or σ s Typically > 5σ needed to declare signal discovery significant Mackey (Stanford) Weighted Classification Cascades December 13, 2014 4 / 15

Background Approximate Median Significance (AMS) Training goal: Select classifier g to maximize AMS 2 on future data Standard two-stage approach Withhold fraction of training events Stage 1: Train any standard classifier on remaining events Stage 2: Order held-out events by classifier scores and select new classification threshold to minimize AMS 2 on held-out data Pros: Requires only standard classification tools; works with any classifier Con: Stage 2 prone to overfitting, may require hand tuning Con: Stage 1 ignores AMS 2 objective, optimizes classification error This talk: A more direct approach to optimizing training AMS 2 that only requires standard classification tools and works with any classifier supporting class weights Mackey (Stanford) Weighted Classification Cascades December 13, 2014 5 / 15

Weighted Classification Cascades Weighted Classification Cascades Algorithm (Weighted Classification Cascade for Maximizing AMS 2 ) initialize signal class weight: u sig 0 > 0 for t = 1 to T compute background class weight: u bac t 1 eusig t 1 u sig t 1 1 train any weighted classifier: g t approximate minimizer of weighted classification error b D (g) u bac t 1 + s D (g) u sig t 1 (where s D (g) = n i=1 I[y i = 1] s D (g) = false negative count) update signal class weight: u sig t log(s D (g t )/b D (g t ) + 1) return g T Advantages Reduces optimizing AMS 2 to series of classification problems Can use any weighted classification procedure AMS 2 improves if g t decreases weighted classification error Questions: Where does this come from? Why should this work? Mackey (Stanford) Weighted Classification Cascades December 13, 2014 6 / 15

Weighted Classification Cascades The Difficulty of Optimizing AMS Approximate Median Significance (squared and halved) ( ) 1 sd (g) + b D (g) 2 AMS2 2(g, D) = (s D (g) + b D (g)) log s D (g) b D (g) True positive count s D (g) = n i=1 I[g(x i) = 1, y i = 1] False positive count b D (g) = n i=1 I[g(x i) = 1, y i = 1] 1 2 AMS2 2 is Combinatorial, as a function of indicator functions Non-decomposable across events, due to logarithm Convex in (s D (g), b D (g)), bad for maximization Mackey (Stanford) Weighted Classification Cascades December 13, 2014 7 / 15

Weighted Classification Cascades Linearizing AMS with Convex Duality Observation: ( ) 1 sd (g) 2 AMS2 2(g, D) = b D (g)f 2 b D (g) = sup u s D (g) f2 (u) b D (g) u = inf u = b D (g) sup u s D(g) u b D (g) f 2 (u) u s D(g) + f 2 (u) b D (g) u n i=1 I[y i = 1] where f 2 (t) = (1 + t) log(1 + t) t is convex f 2 admits variational representation f 2 (t) = sup u ut f 2 (u) in terms of convex conjugate f 2 (u) sup t tu f 2 (t) = e u u 1 Since false negative count s D (g) = n i=1 I[y i = 1] s D (g) Mackey (Stanford) Weighted Classification Cascades December 13, 2014 8 / 15

Weighted Classification Cascades Optimizing AMS with Coordinate Descent Take-away 1 2 AMS2 2(g, D) = inf u u s D(g) + (e u u 1) b D (g) u n i=1 I[y i = 1] Maximizing AMS 2 is equivalent to minimizing weighted error R 2 (g, u, D) u s D (g) + (e u u 1) b D (g) u n i=1 I[y i = 1] over classifiers g and signal class weight u jointly Optimize R 2 (g, u, D) with coordinate descent Update g t for fixed u t 1 : train weighted classifier Update u t for fixed g t : closed form, u = log(s D (g t )/b D (g t ) + 1) AMS 2 increases whenever a new g t+1 achieves smaller weighted classification error with respect to u t than its predecessor g t : 1 2 AMS 2(g t+1 ) 2 R 2 (g t+1, u t ) < R 2 (g t, u t ) = 1 2 AMS 2(g t ) 2 Minorization-maximization algorithm (like EM) Mackey (Stanford) Weighted Classification Cascades December 13, 2014 9 / 15

Weighted Classification Cascades Optimizing Alternative Significance Measures Simpler Form of AMS: AMS 3 (g, D) = s D (g)/ b D (g) Approximates AMS 2 = AMS 3 1 + O((s/b) 3 ) when s b Amenable to weighted classification ( ) cascading 1 sd (g) 2 AMS2 3(g, D) = b D (g)f 3 for convex f 3 (t) = (1/2)t 2 b D (g) (Can also support uncertainty in b: b D (g) b D (g) + σ b ) Algorithm (Weighted Classification Cascade for Maximizing AMS 3 ) for t = 1 to T compute background class weight: u bac t 1 (usig ) 2 /2 train any weighted classifier: g t approximate minimizer of weighted classification error b D (g) u bac t 1 + s D (g) u sig t 1 update signal class weight: u sig t s D (g t )/b D (g t ) Mackey (Stanford) Weighted Classification Cascades December 13, 2014 10 / 15

HiggsML Challenge HiggsML Challenge Case Study Cascading in the Wild So far, recipe for turning classifier into training AMS maximizer Must be coupled with effective regularization strategies to ensure adequate test set generalization Team mymo incorporated two practical variants of cascading into HiggsML challenge solution, placing 31st out of 1800 teams Cascading Variant 1 Fit each classifier g t using XGBoost implementation of gradient tree boosting 1 To curb overfitting, computed true and false positive counts on held-out dataset D val and updated the class weight parameter u sig t using s Dval (g t ) and b Dval (g t ) in lieu of s D (g t ) and b D (g t ) 1 https://github.com/tqchen/xgboost Mackey (Stanford) Weighted Classification Cascades December 13, 2014 11 / 15

HiggsML Challenge HiggsML Challenge Case Study Cascading in the Wild So far, recipe for turning classifier into training AMS maximizer Must be coupled with effective regularization strategies to ensure adequate test set generalization Team mymo incorporated two practical variants of cascading into HiggsML challenge solution, placing 31st out of 1800 teams Cascading Variant 2 Maintained single persistent classifier, the complexity of which grew on each cascade round Developed a customized XGBoost classifier that, on cascade round t, introduced a single new decision tree based on the gradient of the round t weighted classification error In effect, each classifier g t was warm-started from the prior round classifier g t 1 Mackey (Stanford) Weighted Classification Cascades December 13, 2014 12 / 15

HiggsML Challenge HiggsML Challenge Case Study Cascading in the Wild So far, recipe for turning classifier into training AMS maximizer Must be coupled with effective regularization strategies to ensure adequate test set generalization Team mymo incorporated two practical variants of cascading into HiggsML challenge solution, placing 31st out of 1800 teams Final Solution Ensemble of cascade procedures of each variant and several non-cascaded (standard two-stage / hand-tuned) XGBoost, random forest, and neural network models Ensemble of all non-cascade models yielded a private leaderboard score of 3.67 (roughly 198th place) Each cascade variant alone yielded 3.65 Incorporating the cascade models into ensemble yielded 3.72594 Mackey (Stanford) Weighted Classification Cascades December 13, 2014 13 / 15

The Future Beyond the HiggsML Challenge Next Steps More comprehensive, controlled empirical evaluation of cascading More extensive exploration of strategies for ensuring good generalization Thanks! Mackey (Stanford) Weighted Classification Cascades December 13, 2014 14 / 15

References I The Future Cowan, Glen, Cranmer, Kyle, Gross, Eilam, and Vitells, Ofer. Asymptotic formulae for likelihood-based tests of new physics. The European Physical Journal C-Particles and Fields, 71(2):1 19, 2011. Mackey (Stanford) Weighted Classification Cascades December 13, 2014 15 / 15