Resampling Methods CAPT David Ruth, USN

Similar documents
ECE521 week 3: 23/26 January 2017

Diagnostics. Gad Kimmel

Data Mining. 3.6 Regression Analysis. Fall Instructor: Dr. Masoud Yaghini. Numeric Prediction

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Generalized Boosted Models: A guide to the gbm package

Experimental Design and Data Analysis for Biologists

Influence measures for CART

Statistical aspects of prediction models with high-dimensional data

Supplementary material for Intervention in prediction measure: a new approach to assessing variable importance for random forests

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Chapter 17: Undirected Graphical Models

Proteomics and Variable Selection

Introduction to Machine Learning

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Data splitting. INSERM Workshop: Evaluation of predictive models: goodness-of-fit and predictive power #+TITLE:

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

A Bias Correction for the Minimum Error Rate in Cross-validation

Statistical Consulting Topics Classification and Regression Trees (CART)

Variable Selection and Sensitivity Analysis via Dynamic Trees with an application to Computer Code Performance Tuning

Analysis of Fast Input Selection: Application in Time Series Prediction

An Empirical Study of Building Compact Ensembles

The computationally optimal test set size in simulation studies on supervised learning

Ensemble Methods for Machine Learning

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

Decision Tree Learning

1 Handling of Continuous Attributes in C4.5. Algorithm

BAGGING PREDICTORS AND RANDOM FOREST

Discrete Multivariate Statistics

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Bayesian Patchworks: An Approach to Case-Based Reasoning

PDEEC Machine Learning 2016/17

Data Mining und Maschinelles Lernen

Inducing Polynomial Equations for Regression

Empirical Risk Minimization, Model Selection, and Model Assessment

Variance Reduction and Ensemble Methods

Bootstrapping, Randomization, 2B-PLS

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

MS-C1620 Statistical inference

SF2930 Regression Analysis

Manual for a computer class in ML

Informal Definition: Telling things apart

Holdout and Cross-Validation Methods Overfitting Avoidance

STATISTICS 407 METHODS OF MULTIVARIATE ANALYSIS TOPICS

Data Mining Classification Trees (2)

CS6220: DATA MINING TECHNIQUES

Decision Tree Learning

Advanced Techniques for Mining Structured Data: Process Mining

Lossless Online Bayesian Bagging

Lecture 7: DecisionTrees

Prerequisite: STATS 7 or STATS 8 or AP90 or (STATS 120A and STATS 120B and STATS 120C). AP90 with a minimum score of 3

Big Data Analysis with Apache Spark UC#BERKELEY

Cross-Validation with Confidence

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Regression tree methods for subgroup identification I

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Stability-Based Model Selection

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Generalization Error on Pruning Decision Trees

Classification 1: Linear regression of indicators, linear discriminant analysis

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Probabilistic Random Forests: Predicting Data Point Specific Misclassification Probabilities ; CU- CS

Sleep data, two drugs Ch13.xls

Gaussian Process Vine Copulas for Multivariate Dependence

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Advanced Statistical Methods: Beyond Linear Regression

UVA CS 4501: Machine Learning

Machine learning - HT Basis Expansion, Regularization, Validation

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Optimal Kernel Shapes for Local Linear Regression

Statistics and learning: Big Data

Multivariate statistical methods and data mining in particle physics

Machine Learning. Ensemble Methods. Manfred Huber

On Multi-Class Cost-Sensitive Learning

Distribution-Free Distribution Regression

Bayesian Patchworks: An Approach to Case-Based Reasoning

Performance Evaluation and Comparison

Bootstrap & Confidence/Prediction intervals

Describing Contingency tables

CS 540: Machine Learning Lecture 1: Introduction

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Making Our Cities Safer: A Study In Neighbhorhood Crime Patterns

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning, Fall 2009: Midterm

Classification using stochastic ensembles

18.312: Algebraic Combinatorics Lionel Levine. Lecture 19

1 Handling of Continuous Attributes in C4.5. Algorithm

Supporting Statistical Hypothesis Testing Over Graphs

Ensemble Methods: Jay Hyer

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Sparse representation classification and positive L1 minimization

Machine Learning 2nd Edi7on

Structured Statistical Learning with Support Vector Machine for Feature Selection and Prediction

Cross-Validation with Confidence

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

A Course in Applied Econometrics Lecture 7: Cluster Sampling. Jeff Wooldridge IRP Lectures, UW Madison, August 2008

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Transcription:

Resampling Methods CAPT David Ruth, USN Mathematics Department, United States Naval Academy Science of Test Workshop 05 April 2017

Outline Overview of resampling methods Bootstrapping Cross-validation Permutation tests

Overview of resampling methods Re- sampling methods => methods applied to an existing sample. Gist: Sample from existing sample to obtain new sample(s). Bootstrapping: Sample WITH replacement, treating original sample as a proxy for the population of interest. Cross-validation and permutation tests: Sample WITHOUT replacement, using exchangeability assumptions for inference.

Bootstrapping

Bootstrapping The bootstrap is a tool for assessing statistical accuracy. Goal: Estimate any aspect of the distribution of S Z, where S is a statistic of interest and Z = Z 1,, Z n. Idea: Approximate the (unknown) distribution function, F, for the Z i s with the empirical distribution function, F. Draw bootstrap samples from F to estimate quantity of interest.

Bootstrapping For example, we can estimate the variance of S(Z) by Var S(Z) = 1 B 1 B i=1 S Z i S

Bootstrapping Algorithm: Given original sample Z = Z 1,, Z N, Choose some large B as the number of bootstrap samples. for (i in 1:B){ # Sample Z WITH REPLACEMENT to obtain bootstrap sample Z i = Z i1,, Z in # Compute statistic S Z i } Use the B bootstrap replications of the S Z i s to estimate quantity of interest.

Bootstrapping

Bootstrapping

Bootstrapping R

Cross-validation

Cross-validation One of the simplest and most widely used methods for estimating prediction error. If we had enough data, we might set aside a test set or validation set to assess a fitted prediction model s performance. Why not just use training data as test data? OVERFITTING!

Cross-validation Goal: Estimate expected extra-sample error E L Y, f(x) when X and Y are drawn from a from an independent test sample (e.g., E Y X β 2 ). Idea: Partition original data set into training and test sets. Fit model to training set; validate analysis on test set. Perform multiple rounds of partitioning and average prediction error over all rounds.

Cross-validation K-fold cross validation is widely used with K = 5 or K = 10 recommended as a good compromise. Algorithm: Split data into K roughly equal-sized parts. for (k in 1:K){ # Leave out set k, and fit model to remaining parts. # Compute prediction error for fitted model on set k.} Average the k prediction errors.

Cross-validation

Real-world example: Heart data Data: heart disease diagnosis for 303 patients at the Cleveland Clinic Foundation, plus 75 other attributes We consider 5 quantitative and 8 categorical explanatory variables with separate binary response (presence of heart disease; evident in 139 of 303 patients): age sex cp trestbps chol fbs restecg thalach exang oldpeak slope ca thal 1 63 1 1 145 233 1 2 150 0 2.3 3 0.0 6.0 2 67 1 4 160 286 0 2 108 1 1.5 2 3.0 3.0 3 67 1 4 120 229 0 2 129 1 2.6 2 2.0 7.0 4 37 1 3 130 250 0 0 187 0 3.5 3 0.0 3.0 Data available from the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/heart+disease

Cross-validation R

Permutation tests

Permutation tests Permutation tests may be used to test for a distributional difference among groups. Under the assumption that observations are exchangeable, the distribution of a test statistic may be approximated by its empirical distribution under all possible label permutations. All possible may be rather large, so RESAMPLE!

Permutation tests Simple example: difference in resting blood pressure from heart data (univariate) Noisy example: difference in resting blood pressure from heart data with tainted labels (univariate) MCC example: difference in resting blood pressure from heart data with tainted labels (multivariate)

Permutation tests R

Mean Cross Count Test

Problem statement Given: Two sets of independent multivariate observations (with categorical or quantitative attributes, or both). Formally: N = m + n independent p-variate observations A = X 1,, X m, B = X m+1,, X N, X i ~ F i: 1 i m, X j ~G j: m + 1 j N. Goal: Develop a robust two-sample test against H 0 : F = G.

Gower dissimilarity for mixed data Given p-variate observations X 1, X 2,, X N, the dissimilarity d ij,k between observations X i and X j on covariate k is given by d ij,k = 0 if covariate k is categorical and x ik = x jk, 1 if covariate k is categorical and x ik x jk, x ik x jk R k if covariate k is quantitative, where x ik and x jk are the i and j entries, respectively, in the column associated with covariate k, and R k is the range of covariate k. Gower s dissimilarity measure is a weighted average: p d Gower X i, X j = k=1 ij,k d ij,k p k=1 ij,k

Tree-based dissimilarity In the CART setting, consider two observations to be alike if they fall into the same leaf of a tree. Create p trees, using each predictor in turn as a response variable (may prune to avoid overfitting). Let I k i, j = I X i and X j are in different leaves of tree k. Define treeclust dissimilarity as d treeclust X i, X j = 1 K k=1 K (or variation on this theme). w k I k (i, j) The R package treeclust finds these pairwise dissimilarities.

Mean Cross Count test Given some choice of informative distance measure for mixed data, we use a graph-theoretic approach to determine whether a difference exists between groups. Let each observation be a vertex in a graph, G, and each pair of observations be an (undirected) edge. Assign interpoint dissimilarities as edge weights (from treeclust, for example).

Mean Cross Count test With respect to the weighted graph, G, find a minimum-weight r-regular spanning subgraph, G r. Note: G r does not depend on group labels. Let A r be the number of edges in G r which have one vertex in the first group and the other in the second group. Call T r = A r the Mean Cross-Count (MCC) r and use this statistic for our two-sample test.

10 28

10 A r r = 12 3 = 4 p-value.11

Supporting R packages treeclust (treeclustering) produces tree-based dissimilarities allows a number of user options AcrossTic (A cost-minimal regular spanning subgraph with Tree clustering) finds minimum weight regular spanning subgraphs and associated test statistic for two-sample problem ptest performs permutation test for p-values plot allows 2D view of spanning subgraph and MCC

Mean Cross Count test R

References Anglemyer, A., Miller, M., Buttrey, S., Whitaker, L. (2015), Suicide Rates and Methods in Active Duty Military Personnel, 2005 to 2011; A Cohort Study, Annals of Internal Medicine, 165(3). Blake, C. and Merz, C. (1998), UCI repository of machine learning databases, University of California, Irvine, Department of Information and Computer Sciences. Buttrey, S. and Whitaker, L. (2015), treeclust: An R Package for Tree- Based Clustering Dissimilarities, R Journal, 7(2). Friedman, J., Hastie, T., Tibshirani, R., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer Series in Statistics, 2009. Ruth, D. (2014), A New Multivariate Two-Sample Test Using Regular Minimum-Weight Spanning Subgraphs, Journal of Statistical Distributions and Applications, 1(1).