Harrison B. Prosper. Bari Lectures

Similar documents
Harrison B. Prosper. Bari Lectures

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c /9/9 page 331 le-tex

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Learning with multiple models. Boosting.

CS-E3210 Machine Learning: Basic Principles

Statistics and learning: Big Data

Learning Ensembles. 293S T. Yang. UCSB, 2017.

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

What makes good ensemble? CS789: Machine Learning and Neural Network. Introduction. More on diversity

Analysis Techniques Multivariate Methods

StatPatternRecognition: A C++ Package for Multivariate Classification of HEP Data. Ilya Narsky, Caltech

A Brief Introduction to Adaboost

JEROME H. FRIEDMAN Department of Statistics and Stanford Linear Accelerator Center, Stanford University, Stanford, CA

Statistical Methods in Particle Physics

Multivariate Analysis Techniques in HEP

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Importance Sampling: An Alternative View of Ensemble Learning. Jerome H. Friedman Bogdan Popescu Stanford University

CS7267 MACHINE LEARNING

Announcements Kevin Jamieson

Ensembles. Léon Bottou COS 424 4/8/2010

Multivariate statistical methods and data mining in particle physics

ECE 5424: Introduction to Machine Learning

Voting (Ensemble Methods)

Variance Reduction and Ensemble Methods

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

NON-TRIVIAL APPLICATIONS OF BOOSTING:

Ensembles of Classifiers.

Lecture 13: Ensemble Methods

Review of Lecture 1. Across records. Within records. Classification, Clustering, Outlier detection. Associations

Stochastic Gradient Descent

Multivariate statistical methods and data mining in particle physics Lecture 4 (19 June, 2008)

Decision Trees: Overfitting

Machine Learning. Ensemble Methods. Manfred Huber

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions. f(x) = c m 1(x R m )

AdaBoost. Lecturer: Authors: Center for Machine Perception Czech Technical University, Prague

UVA CS 4501: Machine Learning

Ensemble Methods for Machine Learning

Does Modeling Lead to More Accurate Classification?

Big Data Analytics. Special Topics for Computer Science CSE CSE Feb 24

A Study of Relative Efficiency and Robustness of Classification Methods

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CS229 Supplemental Lecture notes

Lecture 7: DecisionTrees

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Ensemble Methods and Random Forests

Monte Carlo Theory as an Explanation of Bagging and Boosting

Boosting. Acknowledgment Slides are based on tutorials from Robert Schapire and Gunnar Raetsch

Data Mining und Maschinelles Lernen

Machine Learning and Data Mining. Decision Trees. Prof. Alexander Ihler

Neural Networks and Ensemble Methods for Classification

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

VBM683 Machine Learning

CS145: INTRODUCTION TO DATA MINING

Generative v. Discriminative classifiers Intuition

Lecture 3: Decision Trees

ECE 5984: Introduction to Machine Learning

Machine Learning Linear Models

Ensemble learning 11/19/13. The wisdom of the crowds. Chapter 11. Ensemble methods. Ensemble methods

Boosting Methods: Why They Can Be Useful for High-Dimensional Data

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017

A Gentle Introduction to Gradient Boosting. Cheng Li College of Computer and Information Science Northeastern University

Statistical Machine Learning from Data

Informal Definition: Telling things apart

Learning Decision Trees

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Lecture 8. Instructor: Haipeng Luo

BAGGING PREDICTORS AND RANDOM FOREST

Logistic Regression and Boosting for Labeled Bags of Instances

RANDOMIZING OUTPUTS TO INCREASE PREDICTION ACCURACY

1 Handling of Continuous Attributes in C4.5. Algorithm

Bayesian Learning (II)

Chapter 18. Decision Trees and Ensemble Learning. Recall: Learning Decision Trees

Decision Trees. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. February 5 th, Carlos Guestrin 1

StatPatternRecognition: User Guide. Ilya Narsky, Caltech

Logistic Regression. Machine Learning Fall 2018

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Learning theory. Ensemble methods. Boosting. Boosting: history

Deep Boosting. Joint work with Corinna Cortes (Google Research) Umar Syed (Google Research) COURANT INSTITUTE & GOOGLE RESEARCH.

Applied Multivariate and Longitudinal Data Analysis

Linear Classifiers. Michael Collins. January 18, 2012

Randomized Decision Trees

LogitBoost with Trees Applied to the WCCI 2006 Performance Prediction Challenge Datasets

8.6 Bayesian neural networks (BNN) [Book, Sect. 6.7]

Boosting with decision stumps and binary features

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Data Mining and Analysis: Fundamental Concepts and Algorithms

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

CSCI-567: Machine Learning (Spring 2019)

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Ensemble Learning in the Presence of Noise

Ordinal Classification with Decision Rules

Obtaining Calibrated Probabilities from Boosting

ENSEMBLES OF DECISION RULES

Learning Decision Trees

2 Upper-bound of Generalization Error of AdaBoost

Decision trees COMS 4771

Advanced Machine Learning

Statistical Methods for Particle Physics Lecture 2: multivariate methods

Transcription:

Harrison B. Prosper Florida State University Bari Lectures 30, 31 May, 1 June 2016 Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 1

h Lecture 1 h Introduction h Classification h Grid Searches h Decision Trees h Lecture 2 h Boosted Decision Trees h Lecture 3 h Neural Networks h Bayesian Neural Networks 2

The goal of a typical multivariate method is to approximate the mapping of inputs, or features, x = (x 1, x 2,..., x n ) to outputs, or responses, y, where y = f (x) assuming some specific class { f } of functions, together with some constraints on this class, e.g., the functions should be smooth. Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 3

Today, we shall explore a method called boosted decision trees using the wine tasting example based on data by Cortez et al.* http://www.vinhoverde.pt/en/history-of-vinho-verde * P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 4

Decision tree: a sequence of if then else statements. root node child node Basic idea: recursively partition the space {x} into regions of increasing purity. Geometrically, a decision tree is a d-dimensional histogram in which the bins are built using recursive binary partitioning. MiniBoone, Byron Roe leaf node 6

To each bin, we associate the value of the function f (x) to be approximated. 200 f(x) = 0 f(x) = 1 That way, we arrive at a piecewise constant approximation of f (x). MiniBoone, Byron Roe 100 PMT Hits 0 B = 10 S = 9 B = 37 S = 4 B = 1 S = 39 f(x) = 0 0 Energy (GeV) 0.4 7

For each variable, find the best partition ( cut ), defined as the one that yields the greatest decrease in impurity = Impurity (parent bin) Impurity ( left -bin) Impurity ( right -bin) Then choose the best partition among all partitions, and repeat with each child bin 200 100 PMT Hits 0 f(x) = 0 f(x) = 1 B = 10 S = 9 B = 37 S = 4 B = 1 S = 39 f(x) = 0 0 Energy (GeV) 0.4 8

The most common impurity measure is the Gini index (Corrado Gini, 1884-1965): Gini index = p (1 p) where p is the purity p = S / (S + B) p = 0 or 1 = maximal purity p = 0.5 = maximal impurity 200 100 PMT Hits 0 f(x) = 0 f(x) = 1 B = 10 S = 9 B = 37 S = 4 B = 1 S = 39 f(x) = 0 0 Energy (GeV) 0.4 9

Until relatively recently, the goal of researchers who worked on classification methods was to construct directly a single high performance classifier. However, in 1997, AT&T researchers Y. Freund and R.E. Schapire [Journal of Computer and Sys. Sci. 55 (1), 119 (1997)], showed that it was possible to build highly effective classifiers by combining many weak ones! This was the first successful method to boost (i.e., enhance) the performance of poorly performing classifiers by averaging them. Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 11

Suppose you have a collection of classifiers f (x, w k ), which, individually, perform only marginally better than random guessing. Such classifiers are called weak learners. It is possible to build highly effective classifiers by averaging many weak learners: K k =1 f (x) = a 0 + a k f (x, w k ) Jeromme Friedman & Bogdan Popescu (2008) 12

The most popular methods (used mostly with decision trees) are: h Bagging: each tree is trained on a bootstrap* sample drawn from the training set h Random Forest: h Boosting: bagging with randomized trees each tree trained on a different reweighting of the training set *A bootstrap sample is a sample of size N drawn, with replacement, from another of the same size. Duplicates can occur and are allowed. 13

The AdaBoost algorithm of Freund and Schapire uses decision trees f (x, w) as the weak learners, where w are weights assigned to the objects to be classified, each associated with a label y = ±1, e.g., +1 for good wine, 1 for bad. The value assigned to each leaf of f (x, w) is also ±1. Consequently, for object n, associated with values (y n, x n ), the product f (x n, w) y n > 0 for a correct classification f (x n, w) y n < 0 for an incorrect classification Next, we consider the actual boosting algorithm Y. Freund and R.E. Schapire. Journal of Computer and Sys. Sci. 55 (1), 119 (1997) 14

Y. Freund and R.E. Schapire. Journal of Computer and Sys. Sci. 55 (1), 119 (1997) 15

AdaBoost is a very non-intuitive algorithm. However, soon after its invention Friedman, Hastie and Tibshirani showed that the algorithm is mathematically equivalent to minimizing the following risk (or cost) function R(F) = which implies that p(x, y)exp( yf(x))dx dy, where F(x) = α k f (x,w k ) K k =1 1 D(x) = 1 + exp( 2F(x)) can be interpreted as a probability, even though F cannot! J. Friedman, T. Hastie and R. Tibshirani, ( Additive logistic regression: a statistical view of boosting, The Annals of Statistics, 28(2), 377-386, (2000)) 16

Let s use the AdaBoost algorithm to build a classifier that can distinguish between good wines and bad wines from the Vinho Verde area of Portugal using the data from Cortez et al. We ll define a good wine as one with rating 0.7 on a scale from 0 to 1, where 1 is a wine from Heaven and 0 is a wine from Hell! First, let s look at the training data Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 17

Data: [Cortez et al., 2009]. variables description acetic acetic acid citric citric acid sugar salt residual sugar NaCl SO2free free sulfur dioxide SO2tota total sulfur dioxide ph ph sulfate potassium sulfate alcohol alcohol content quality (between 0 and 1) Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 18

Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 19

To make visualization easier, we ll use only two variables: SO2tota: the total sulfur dioxide content (mg/dm 3 ) alcohol: alcohol content (% volume) Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 20

x = SO2tota y = alcohol Distribution of BDT response 99 BDT (x, y) = a k f (x, y, w k ) k =0 Fraction of bad wine rejected for a given fraction of good wine accepted. Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 21

The upper figures are density plots of the training data. The lower plots are approximations of D = p(x, y good) p(x, y good) + p(x, y bad) The left, uses 2-D histograms, the right uses the BDT. Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 22

Let s dig more deeply Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 23

Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 24

Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 25

Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 26

Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 27

h It is possible to average many relatively crude decision trees to obtain a better approximation to the function D(x, y) = p(x, y good) p(x, y good) + p(x, y bad) h Tomorrow, we shall consider examples of classification and regression using neural networks. Lectures on Multivariate Methods Harrison B. Prosper Bari, 2016 28