Classification using stochastic ensembles

Similar documents
Lecture 4 Discriminant Analysis, k-nearest Neighbors

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Variance Reduction and Ensemble Methods

UVA CS 4501: Machine Learning

day month year documentname/initials 1

Machine Learning Linear Classification. Prof. Matteo Matteucci

BAGGING PREDICTORS AND RANDOM FOREST

Data Mining und Maschinelles Lernen

Applied Machine Learning Annalisa Marsico

ABC random forest for parameter estimation. Jean-Michel Marin

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Machine Learning. Ensemble Methods. Manfred Huber

Variable Selection and Weighting by Nearest Neighbor Ensembles

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Classification Ensemble That Maximizes the Area Under Receiver Operating Characteristic Curve (AUC)

ECE 5424: Introduction to Machine Learning

Statistical Machine Learning from Data

A Bias Correction for the Minimum Error Rate in Cross-validation

Performance Evaluation and Comparison

Applied Multivariate and Longitudinal Data Analysis

Oliver Dürr. Statistisches Data Mining (StDM) Woche 11. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Diagnostics. Gad Kimmel

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

PDEEC Machine Learning 2016/17

Lecture 13: Ensemble Methods

CS7267 MACHINE LEARNING

VBM683 Machine Learning

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Random Forests. These notes rely heavily on Biau and Scornet (2016) as well as the other references at the end of the notes.

Ensemble Methods: Jay Hyer

1 Handling of Continuous Attributes in C4.5. Algorithm

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

SB2b Statistical Machine Learning Bagging Decision Trees, ROC curves

Hierarchical models for the rainfall forecast DATA MINING APPROACH

Introduction to Supervised Learning. Performance Evaluation

CS145: INTRODUCTION TO DATA MINING

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Advanced Statistical Methods: Beyond Linear Regression

Supplementary material for Intervention in prediction measure: a new approach to assessing variable importance for random forests

Adaptive Crowdsourcing via EM with Prior

Regression tree methods for subgroup identification I

Lecture 9: Classification, LDA

Performance Evaluation

Constructing Prediction Intervals for Random Forests

Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns

arxiv: v5 [stat.me] 18 Apr 2016

Learning with multiple models. Boosting.

Lecture 9: Classification, LDA

Evaluation. Andrea Passerini Machine Learning. Evaluation

Holdout and Cross-Validation Methods Overfitting Avoidance

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

Data Mining and Knowledge Discovery: Practice Notes

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

SF2930 Regression Analysis

Lecture 9: Classification, LDA

Introduction to Signal Detection and Classification. Phani Chavali

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Model Accuracy Measures

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Evaluation requires to define performance measures to be optimized

CMSC858P Supervised Learning Methods

ECE521 week 3: 23/26 January 2017

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

Hypothesis Evaluation

Lectures in AstroStatistics: Topics in Machine Learning for Astronomers

Look before you leap: Some insights into learner evaluation with cross-validation

Gradient Boosting (Continued)

Introduction to Data Science

Data Analytics for Social Science

REGRESSION TREE CREDIBILITY MODEL

1 Handling of Continuous Attributes in C4.5. Algorithm

FINAL: CS 6375 (Machine Learning) Fall 2014

Bagging and Other Ensemble Methods

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

A Simple Algorithm for Learning Stable Machines

Statistics and learning: Big Data

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Machine Learning for OR & FE

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Chapter 14 Combining Models

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

Model Averaging With Holdout Estimation of the Posterior Distribution

Ensemble Methods and Random Forests

OVER 1.2 billion people do not have access to electricity

Classification via kernel regression based on univariate product density estimators

Predicting the Probability of Correct Classification

Chapter 6. Ensemble Methods

Classification: Linear Discriminant Analysis

Regularized Linear Models in Stacked Generalization

Influence measures for CART

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

CART Classification and Regression Trees Trees can be viewed as basis expansions of simple functions

Ensemble Methods for Machine Learning

PATTERN CLASSIFICATION

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

TDT4173 Machine Learning

Transcription:

July 31, 2014

Topics Introduction Topics Classification Application and classfication Classification and Regression Trees Stochastic ensemble methods Our application: USAID Poverty Assessment Tools

Topics Classification Application and classification Classification, or predictive discriminant analysis, involves the assignment of observations to classes. Predictions are based on a model trained in a dataset in which class membership is known (Huberty 1994, Rencher 2002, Hastie et al. 2009). Prediction of qualitative response With class>2, linear regression methods generally not appropriate Methods available in statistics, machine learning, predictive analytics

Topics Classification Application Our classification problem: identifying poor from nonpoor To fulfill the terms of a social safety net intervention in a developing country, we wish to classify households as poor or nonpoor based on a set of observable household characteristics. Households classified as poor will recieve a cash transfer.

Topics Classification Application Our classification problem: identifying poor from nonpoor To fulfill the terms of a social safety net intervention in a developing country, we wish to classify households as poor or nonpoor based on a set of observable household characteristics. Households classified as poor will recieve a cash transfer. Our objective is accurate, out-of-sample, prediction: we want to make predictions about a household s poverty status (otherwise unknown) using a model trained by other households (poverty status known) in that population. Assume that we are indifferent between types of misclassification.

Topics Classification Application Our classification problem: identifying poor from nonpoor To fulfill the terms of a social safety net intervention in a developing country, we wish to classify households as poor or nonpoor based on a set of observable household characteristics. Households classified as poor will recieve a cash transfer. Our objective is accurate, out-of-sample, prediction: we want to make predictions about a household s poverty status (otherwise unknown) using a model trained by other households (poverty status known) in that population. Assume that we are indifferent between types of misclassification. This is a stylized example of the problem faced by USAID, the World Bank, and other institutions attempting to target the poor in developing countries where income status is difficult to assess.

Discriminant methods available in Stata Many discrimination methods are available, including linear, quadratic, logistic, and nonparametric methods: MV discrim lda and [MV] candisc. MV discrim qda provides quadratic discriminant analysis MV discrim logistic provides logistic discriminant analysis. MV discrim knn provides kth-nearest-neighbor discrimination.

Discriminant methods available in Stata Linear discriminant analysis (LDA)

Discriminant methods available in Stata Linear discriminant analysis (LDA) Assumes obs within each class are normally distributed with class specific means but common variance

Discriminant methods available in Stata Linear discriminant analysis (LDA) Assumes obs within each class are normally distributed with class specific means but common variance Assigns to class for which estimated posterior probability is greatest

Discriminant methods available in Stata Linear discriminant analysis (LDA) Assumes obs within each class are normally distributed with class specific means but common variance Assigns to class for which estimated posterior probability is greatest Quadratic discriminant analysis (QDA)

Discriminant methods available in Stata Linear discriminant analysis (LDA) Assumes obs within each class are normally distributed with class specific means but common variance Assigns to class for which estimated posterior probability is greatest Quadratic discriminant analysis (QDA) Assumes obs within each class are normally distributed with class specific means and variance

Discriminant methods available in Stata Linear discriminant analysis (LDA) Assumes obs within each class are normally distributed with class specific means but common variance Assigns to class for which estimated posterior probability is greatest Quadratic discriminant analysis (QDA) Assumes obs within each class are normally distributed with class specific means and variance Assigns to class for which estimated posterior probability is greatest

Discriminant methods available in Stata Logistic discriminant analysis (LD)

Discriminant methods available in Stata Logistic discriminant analysis (LD) Places distributional assumption on the likelihood ratio

Discriminant methods available in Stata Logistic discriminant analysis (LD) Places distributional assumption on the likelihood ratio Assigns to class for which estimated posterior probability is greatest

Discriminant methods available in Stata Logistic discriminant analysis (LD) Places distributional assumption on the likelihood ratio Assigns to class for which estimated posterior probability is greatest K-nearest-neighbor discrimination (KNN)

Discriminant methods available in Stata Logistic discriminant analysis (LD) Places distributional assumption on the likelihood ratio Assigns to class for which estimated posterior probability is greatest K-nearest-neighbor discrimination (KNN) Nonparametric

Discriminant methods available in Stata Logistic discriminant analysis (LD) Places distributional assumption on the likelihood ratio Assigns to class for which estimated posterior probability is greatest K-nearest-neighbor discrimination (KNN) Nonparametric Assigns to class based on distance from neighbors belonging to that class

Discriminant methods available in Stata Logistic discriminant analysis (LD) Places distributional assumption on the likelihood ratio Assigns to class for which estimated posterior probability is greatest K-nearest-neighbor discrimination (KNN) Nonparametric Assigns to class based on distance from neighbors belonging to that class Methods not available in Stata include SVM, (limited), various ensemble methods.

Introduction Classification and regression trees recursively partition a feature space to meet some criteria (entropy reduction, minimized prediction error, etc). Predictions for a given set of features are made based on the relative proportion of classes found in a terminal node (classification) or on the mean response for the data in that partition (regression). hh head age >=32 hh head age <32

Introduction hh head age >=32 hh head age <32 roof = straw roof= tin hh size < 5 hh size >= 5

Stochastic ensemble methods Ensemble methods construct many models on subsets of data (e.g. via resampling with replacement); they then average across these models (or allow them to vote) to obtain a less noisy prediction. One version of this is known as bootstrap aggregation, or bagging (Breiman 1996a). A stochastic ensemble method adds randomness to the construction of the models. This has the advantage of de-correlating models across subsets, which can reduce total variance (Breiman 2001). Out-of-sample error is estimated by training the model in each randomly selected subset, and using the balance of the data to test the model. This estimated out-of-sample error is an unbiased estimator of the true out-of-sample prediction error (Breiman 1996b).

Stochastic ensemble methods for trees Stochastic ensemble methods can address the weaknesses of models (Breiman 2001).

Stochastic ensemble methods for trees Stochastic ensemble methods can address the weaknesses of models (Breiman 2001). For example, if we allow to grow large, we make a bias for variance trade off: large trees will have high variance but low bias.

Stochastic ensemble methods for trees Stochastic ensemble methods can address the weaknesses of models (Breiman 2001). For example, if we allow to grow large, we make a bias for variance trade off: large trees will have high variance but low bias. Bagging produces a large number of approximately unbiased and identically distributed trees. Averaging or voting across these trees significantly reduces the variance of the classifier.

Stochastic ensemble methods for trees Stochastic ensemble methods can address the weaknesses of models (Breiman 2001). For example, if we allow to grow large, we make a bias for variance trade off: large trees will have high variance but low bias. Bagging produces a large number of approximately unbiased and identically distributed trees. Averaging or voting across these trees significantly reduces the variance of the classifier. The variance of the classifier can be further reduced by de-correlating the trees via the introduction of randomness in the tree building process.

Stochastic ensemble methods for trees Stochastic ensemble methods can address the weaknesses of models (Breiman 2001). For example, if we allow to grow large, we make a bias for variance trade off: large trees will have high variance but low bias. Bagging produces a large number of approximately unbiased and identically distributed trees. Averaging or voting across these trees significantly reduces the variance of the classifier. The variance of the classifier can be further reduced by de-correlating the trees via the introduction of randomness in the tree building process. This combination of bagging and decorrelating ensembles of trees produces classification and regression forests.

In R, classification and regression forests can be generated with randomforest (Breiman and Cutler 2001, Liaw and Wiener 2002). Extensions such as quantile regression forests quantregforest (Meinshausen 2006) are also available. Classification Forest (CF) and Regression Forest (RF)

In R, classification and regression forests can be generated with randomforest (Breiman and Cutler 2001, Liaw and Wiener 2002). Extensions such as quantile regression forests quantregforest (Meinshausen 2006) are also available. Classification Forest (CF) and Regression Forest (RF) Minimizes classification error in each binary partition (CF)

In R, classification and regression forests can be generated with randomforest (Breiman and Cutler 2001, Liaw and Wiener 2002). Extensions such as quantile regression forests quantregforest (Meinshausen 2006) are also available. Classification Forest (CF) and Regression Forest (RF) Minimizes classification error in each binary partition (CF) Minimizes MSE in each binary partition (RF)

In R, classification and regression forests can be generated with randomforest (Breiman and Cutler 2001, Liaw and Wiener 2002). Extensions such as quantile regression forests quantregforest (Meinshausen 2006) are also available. Classification Forest (CF) and Regression Forest (RF) Minimizes classification error in each binary partition (CF) Minimizes MSE in each binary partition (RF) Uses bagging to reduce variance

In R, classification and regression forests can be generated with randomforest (Breiman and Cutler 2001, Liaw and Wiener 2002). Extensions such as quantile regression forests quantregforest (Meinshausen 2006) are also available. Classification Forest (CF) and Regression Forest (RF) Minimizes classification error in each binary partition (CF) Minimizes MSE in each binary partition (RF) Uses bagging to reduce variance Randomizes over variables available at any given split

In R, classification and regression forests can be generated with randomforest (Breiman and Cutler 2001, Liaw and Wiener 2002). Extensions such as quantile regression forests quantregforest (Meinshausen 2006) are also available. Classification Forest (CF) and Regression Forest (RF) Minimizes classification error in each binary partition (CF) Minimizes MSE in each binary partition (RF) Uses bagging to reduce variance Randomizes over variables available at any given split Estimates out-of-sample prediction error in the out-of-bag sample

In R, classification and regression forests can be generated with randomforest (Breiman and Cutler 2001, Liaw and Wiener 2002). Extensions such as quantile regression forests quantregforest (Meinshausen 2006) are also available. Classification Forest (CF) and Regression Forest (RF) Minimizes classification error in each binary partition (CF) Minimizes MSE in each binary partition (RF) Uses bagging to reduce variance Randomizes over variables available at any given split Estimates out-of-sample prediction error in the out-of-bag sample Quantile Regression Forest (QRF)

In R, classification and regression forests can be generated with randomforest (Breiman and Cutler 2001, Liaw and Wiener 2002). Extensions such as quantile regression forests quantregforest (Meinshausen 2006) are also available. Classification Forest (CF) and Regression Forest (RF) Minimizes classification error in each binary partition (CF) Minimizes MSE in each binary partition (RF) Uses bagging to reduce variance Randomizes over variables available at any given split Estimates out-of-sample prediction error in the out-of-bag sample Quantile Regression Forest (QRF) Similar to RF, but estimates entire conditional distribution of response variable through a weighting function

In R, classification and regression forests can be generated with randomforest (Breiman and Cutler 2001, Liaw and Wiener 2002). Extensions such as quantile regression forests quantregforest (Meinshausen 2006) are also available. Classification Forest (CF) and Regression Forest (RF) Minimizes classification error in each binary partition (CF) Minimizes MSE in each binary partition (RF) Uses bagging to reduce variance Randomizes over variables available at any given split Estimates out-of-sample prediction error in the out-of-bag sample Quantile Regression Forest (QRF) Similar to RF, but estimates entire conditional distribution of response variable through a weighting function Regression Forest analog of quantile regression

In Stata, we will use a user-written command stens (Nichols 2014) to classify households based on an ensemble of perfect random trees (Cutler and Zhao 2001). Ensemble of Perfect Random Trees

In Stata, we will use a user-written command stens (Nichols 2014) to classify households based on an ensemble of perfect random trees (Cutler and Zhao 2001). Ensemble of Perfect Random Trees Grows trees randomly

In Stata, we will use a user-written command stens (Nichols 2014) to classify households based on an ensemble of perfect random trees (Cutler and Zhao 2001). Ensemble of Perfect Random Trees Grows trees randomly Averages over the most influential voters

Poverty Assessment Tools PAT development in R and Stata The Poverty Assessment Tools were developed by the University of Maryland IRIS Center for USAID. The IRIS tool is typically developed via quantile regression in a randomly selected subset of the data. Accuracy (out of sample prediction error) is assessed on the data not used for model develpment.

Our methods Introduction PAT development in R and Stata We replicate the IRIS tool development process using the same publicly available nationally representative Living Standards Measurement Survey datasets; we then attempt to improve on their estimates.

Our methods Introduction PAT development in R and Stata We replicate the IRIS tool development process using the same publicly available nationally representative Living Standards Measurement Survey datasets; we then attempt to improve on their estimates. We randomly divide the data into training and testing sets, estimate the model in the training data and then assess accuracy in the testing data. We iterate this process 1000 times. We report the means and the 2.5 th and 97.5 th percentile confidence intervals.

Our methods Introduction PAT development in R and Stata We replicate the IRIS tool development process using the same publicly available nationally representative Living Standards Measurement Survey datasets; we then attempt to improve on their estimates. We randomly divide the data into training and testing sets, estimate the model in the training data and then assess accuracy in the testing data. We iterate this process 1000 times. We report the means and the 2.5 th and 97.5 th percentile confidence intervals. We use the 2005 Bolivia Household Survey, the 2004/5 Malawi Integrated Household Survey, and the 2001 East Timor Living Standards Survey.

Classification error Introduction PAT development in R and Stata P = 1 P = 0 P = 1 P = 0 True False Positive Positive (TP) (FP) False Negative (FN) True Negative (TN)

Classification error Introduction PAT development in R and Stata For our application, we re interested in five accuracy measures: Total Accuracy (TA) = 1 N (TP + TN) = 1 1 (FN + FP) = 1 MSE N

Classification error Introduction PAT development in R and Stata For our application, we re interested in five accuracy measures: Total Accuracy (TA) = 1 N (TP + TN) = 1 1 (FN + FP) = 1 MSE N Poverty Accuracy (PA) =TP/(TP + FP)

Classification error Introduction PAT development in R and Stata For our application, we re interested in five accuracy measures: Total Accuracy (TA) = 1 N (TP + TN) = 1 1 (FN + FP) = 1 MSE N Poverty Accuracy (PA) =TP/(TP + FP) Undercoverage (UC) =FN/(TP + FN)

Classification error Introduction PAT development in R and Stata For our application, we re interested in five accuracy measures: Total Accuracy (TA) = 1 N (TP + TN) = 1 1 (FN + FP) = 1 MSE N Poverty Accuracy (PA) =TP/(TP + FP) Undercoverage (UC) =FN/(TP + FN) Leakage (LE) =FP/(TP + FN)

Classification error Introduction PAT development in R and Stata For our application, we re interested in five accuracy measures: Total Accuracy (TA) = 1 N (TP + TN) = 1 1 (FN + FP) = 1 MSE N Poverty Accuracy (PA) =TP/(TP + FP) Undercoverage (UC) =FN/(TP + FN) Leakage (LE) =FP/(TP + FN) Balanced Poverty Accuracy Criterion (BPAC) =TP/(TP + FP) FN/(TP + FP) FP/(TP + FP)

Classification error Introduction PAT development in R and Stata For our application, we re interested in five accuracy measures: Total Accuracy (TA) = 1 N (TP + TN) = 1 1 (FN + FP) = 1 MSE N Poverty Accuracy (PA) =TP/(TP + FP) Undercoverage (UC) =FN/(TP + FN) Leakage (LE) =FP/(TP + FN) Balanced Poverty Accuracy Criterion (BPAC) =TP/(TP + FP) FN/(TP + FP) FP/(TP + FP) The next five slides present the comparative out-of-sample accuracy of discriminant analysis and stochastic ensemble methods in these datasets.

Total Accuracy Introduction PAT development in R and Stata

Poverty Accuracy Introduction PAT development in R and Stata

Leakage Introduction PAT development in R and Stata

Undercoverage Introduction PAT development in R and Stata

Balanced Poverty Accuracy PAT development in R and Stata

Introduction Bioinformatics, Predictive Analytics Stochastic ensemble methods in general, and random forests in particular, have become essential tools in a variety of applications. Bioinformatics: In comparison with DL, KNN, and SVM, Diaz-Uriarte and Alvarez de Andres (2006) conclude, because of its performance and features, random forest and gene selection using random forest should probably become part of the standard tool-box of methods for class prediction and gene selection with microarray data. Kaggle and predictive analytics: the following kaggle competitions were won using random forests Semi-supervised feature learning (computer science) Air quality prediction (environmental science) RTA freeway travel time prediction (urban development/economics) : remote sensing, diagnostics, spam filters

Stochastic ensemble methods have broad applicability to classification and prediction problems; we find their use promising in poverty assessment tool development. Such methods would be additional assets in the Stata classification tool kit.

Breiman, L. 1996a. Bagging predictors. Machine Learning, 26(2):123-140. Breiman, L. 1996b. Out of bag estimation. ftp.stat.berkeley.edu/pub/users/breiman/oobestimation.ps Breiman, L. 2001. Random forests. Machine Learning, 45: 5-32. Breiman. L. and A. Cutler. 2007. Random Forests. www.stat.berkeley.edu/~breiman/randomforests/cc_home.htm Cutler, A. and G. Zhao. 2001. PERT-Perfect random tree ensembles. Computing Science and Statistics, 33. Diaz-Uriarte, R., Alvarez de Andres, S. 2006. Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7:3. Huberty, C. 1994. Applied Discriminant Analysis. New York: Wiley. Hastie, T., R. J. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd Ed. New York: Springer. Liaw, A. and M. Wiener. 2002. Classification and regression by randomforest. R News, 2:18-22. Meinshausen, N. 2006. Quantile regression forests. Journal of Machine Learning Research, 7: 983-999. Rencher, A. 2002. Methods of Multivariate Analysis, 2nd Ed. New York: Wiley.

Classification error For a two-group classification problem, when misclassification costs are equal, MSE = 1 n ( N ˆP i P i ) 2 = 1 (FN + FP) N i=0 P = 1 P = 0 P = 1 P = 0 True False Positive Positive (TP) (FP) False Negative (FN) True Negative (TN)