Diagnostics. Gad Kimmel

Similar documents
Performance Evaluation and Comparison

Stephen Scott.

Applied Machine Learning Annalisa Marsico

Introduction to Supervised Learning. Performance Evaluation

Performance Evaluation

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Least Squares Classification

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

How do we compare the relative performance among competing models?

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Hypothesis Evaluation

Machine Learning Linear Classification. Prof. Matteo Matteucci

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

CSC314 / CSC763 Introduction to Machine Learning

Lecture 3 Classification, Logistic Regression

Pointwise Exact Bootstrap Distributions of Cost Curves

Probability and Statistics. Terms and concepts

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

CS 543 Page 1 John E. Boon, Jr.

Classifier performance evaluation

BAGGING PREDICTORS AND RANDOM FOREST

Classification using stochastic ensembles

Models, Data, Learning Problems

Classification of Longitudinal Data Using Tree-Based Ensemble Methods

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Bayesian Decision Theory

Empirical Evaluation (Ch 5)

Classification: Analyzing Sentiment

ECE 5424: Introduction to Machine Learning

Data Mining and Analysis: Fundamental Concepts and Algorithms

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Performance Evaluation

Generalization, Overfitting, and Model Selection

6.873/HST.951 Medical Decision Support Spring 2004 Evaluation

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Machine Learning. Lecture Slides for. ETHEM ALPAYDIN The MIT Press, h1p://

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation

Bagging. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL 8.7

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Cross-Validation with Confidence

Evaluation & Credibility Issues

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Learning theory. Ensemble methods. Boosting. Boosting: history

Data Privacy in Biomedicine. Lecture 11b: Performance Measures for System Evaluation

Model Accuracy Measures

Data Mining and Knowledge Discovery: Practice Notes

Performance Evaluation and Hypothesis Testing

Performance evaluation of binary classifiers

Distribution-Free Distribution Regression

Classification: Analyzing Sentiment

Resampling Methods CAPT David Ruth, USN

CSC 411: Lecture 03: Linear Classification

Supplementary Information

Performance Evaluation

Cross-Validation with Confidence

Lecture 13: Ensemble Methods

The Perceptron algorithm

Confidence Intervals for Population Mean

ECE521 Lecture7. Logistic Regression

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Variance Reduction and Ensemble Methods

The bootstrap. Patrick Breheny. December 6. The empirical distribution function The bootstrap

Review of some concepts in predictive modeling

Understanding Generalization Error: Bounds and Decompositions

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Resampling techniques for statistical modeling

Bayesian Learning (II)

Lecture 3: Introduction to Complexity Regularization

Learning with multiple models. Boosting.

Machine Learning. Ensemble Methods. Manfred Huber

A Bias Correction for the Minimum Error Rate in Cross-validation

VBM683 Machine Learning

Data Analytics for Social Science

Roman Hornung. Ordinal Forests. Technical Report Number 212, 2017 Department of Statistics University of Munich.

The Bayes Theorema. Converting pre-diagnostic odds into post-diagnostic odds. Prof. Dr. F. Vanstapel, MD PhD Laboratoriumgeneeskunde UZ KULeuven

Anomaly Detection. Jing Gao. SUNY Buffalo

Bias/variance tradeoff, Model assessment and selec+on

Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

Generalized Linear Models

Evaluation. Andrea Passerini Machine Learning. Evaluation

CONTENTS OF DAY 2. II. Why Random Sampling is Important 10 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Generalization, Overfitting, and Model Selection

ECE521 week 3: 23/26 January 2017

Data Mining Chapter 4: Data Analysis and Uncertainty Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

PDEEC Machine Learning 2016/17

STAT Section 2.1: Basic Inference. Basic Definitions

Data Mining und Maschinelles Lernen

Look before you leap: Some insights into learner evaluation with cross-validation

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Linear and Logistic Regression. Dr. Xiaowei Huang

10/05/2016. Computational Methods for Data Analysis. Massimo Poesio SUPPORT VECTOR MACHINES. Support Vector Machines Linear classifiers

A Simple Algorithm for Learning Stable Machines

Transcription:

Diagnostics Gad Kimmel

Outline Introduction. Bootstrap method. Cross validation. ROC plot.

Introduction

Motivation Estimating properties of an estimator. Given data samples say the average. x 1, x 2,..., x N, evaluate some estimator, How can we estimate its properties (e.g., its variance)? Model selection. How many parameters should we use?

Bootstrap Method

Evaluating Accuracy A simple approach for accuracy estimation is to provide the bias or variance of the estimator. Example: suppose the samples are independently identically distributed (i.i.d.), with finite variance. We know, by the central limit theorem, that n 1/ 2 x n Z ~N 0,1 Roughly speaking, x n is normally distributed with expectation and variance 2 /n.

Assumptions Do Not Hold What if the r.v. are not i.i.d.? What if we want to evaluate another estimator (and not )? x n It would be nice to have many different samples of samples. In that case, one could calculate the estimator for each sample of samples, and infer its distribution. But... we don't have it.

Solution - Bootstrap Estimating the sampling distribution of an estimator by resampling with replacement from the original sample. Efron, The Annals of Statistics, '79.

Bootstrap - Illustration Goal: Sampling from P. P

Bootstrap - Illustration Goal: Sampling from P. P x 1, x 2, x 3, x 4,..., x n

Bootstrap - Illustration Goal: Sampling from P. P x 1, x 2, x 3, x 4,..., x n... in order to estimate the variance of an estimator.

Bootstrap - Illustration Samples Estimator P x 1,1, x 1,2, x 1,3,..., x 1,n e 1 x 2,1, x 2,2, x 2,3,...,x 2, n e 2 x 3,1, x 3,2, x 3,3,..., x 3, n e 3 x 4,1, x 4,2, x 4,3,...,x 4, n e 4... x m,1,x m, 2, x m, 3,..., x m, n e m

Bootstrap - Illustration Samples Estimator P x 1,1, x 1,2, x 1,3,..., x 1,n e 1 x 2,1, x 2,2, x 2,3,...,x 2, n e 2 x 3,1, x 3,2, x 3,3,..., x 3, n e 3 x 4,1, x 4,2, x 4,3,...,x 4, n e 4... x m,1,x m, 2, x m, 3,..., x m, n e m What is the variance of e?

Bootstrap - Illustration Samples Estimator P x 1,1, x 1,2, x 1,3,..., x 1,n e 1 x 2,1, x 2,2, x 2,3,...,x 2, n e 2 x 3,1, x 3,2, x 3,3,..., x 3, n e 3 x 4,1, x 4,2, x 4,3,...,x 4, n e 4... x m,1,x m, 2, x m, 3,..., x m, n e m Estimate the variance by var e = 1 m i=1 m ei 2

Bootstrap - Illustration We only have 1 sample: P x 1, x 2, x 3, x 4,..., x n

Bootstrap - Illustration Sampling is done from the empirical distribution. Samples Estimator P x 1, x 2, x 3, x 4,...,x n z 1,1,z 1,2, z 1,3,..., z 1,n e 1 z 2,1, z 2,2, z 2,3,..., z 2, n e 2 z 3,1, z 3,2, z 3,3,..., z 3,n e 3 z 4,1, z 4,2, z 4,3,..., z 4, n e 4... z m,1,z m, 2, z m, 3,..., z m, n e m

Formalization The data is x 1, x 2,..., x n ~ P. Note that the distribution function P is unknown. We sample m samples Y 1, Y 2,...,Y m. Y i = z i,1, z i,2,..., z i, n contains n samples drawn from the empirical distribution of the data: Pr[z j, k =x i ]= # x i n Where # x i is the number of times x i appears in the original data.

The Main Idea Y i ~ P. We wish that P= P. Is it (always) true? NO. P Rather, is an approximation of P.

Example 1 The yield of the Dow Jones Index over the past two years is ~12%. You are considering a broker that had a yield of 25%, by picking specific stocks from the Dow Jones. Let x be a r.v. that represents the yield of randomly selected stocks. Do we know the distribution of x?

Example 1 (cont.) x 1, x 2,...,x 10,000 Prepare a sample, where each x i is the yield of randomly selected stocks. Approximate the distribution of x using this sample.

Evaluation of Estimators Using the approximate distribution, we can evaluate estimators. E.g.: Variance of the mean. Confidence intervals.

Example 1 (cont.) What is the probability to obtain yield larger than 25% (p-value)?

Example 1 (cont.) What is the probability to obtain yield larger than 25% (p-value)? 30%

Example 2 - Decision tree Decision tree - short introduction.

Example 2 Building a decision tree.

Example 2 Many other trees can be built, using different algorithms. For a specific tree one can calculate prediction accuracy: # of elements classified correctly total # of elements

Example 2 Many other trees can be built, using different algorithms. For a specific tree one can calculate prediction accuracy: # of elements classified correctly total # of elements For calculating error bars for this value, we need to sample more, apply the algorithm many times, and each time evaluate the prediction.

Example 2 - Applying Bootstrap Build decision tree for each sample. Calculate prediction for each tree. Evaluate error bars based on predictions.

Example 2 - Applying Bootstrap T 1,T 2,...,T n Build decision tree for each sample. p 1, p 2,..., p n Calculate prediction for each tree. ±1.96 STD p 1, p 2,..., p n Evaluate error bars based on predictions.

Example 2 - Applying Bootstrap But we have only one data set! Build decision tree for each sample. Calculate prediction for each tree. Evaluate error bars based on predictions.

Example 2 - Applying Bootstrap Use bootstrap to prepare many samples. Build decision tree for each sample. Calculate prediction for each tree. Evaluate error bars based on predictions.

Cross Validation

Objective Model selection.

Formalization Let (x, y) drawn from distribution P. Where x R n and y R Let f : R n R be a learning algorithm, with parameter(s).

Example Regression model.

20 Regression Order of 1 (Linear) 18 16 14 12 10 8 6 4 2 0 4 6 8 10 12 14 16

20 Regression Order of 2 18 16 14 12 10 8 6 4 2 0 4 6 8 10 12 14 16

20 Regression Order of 3 18 16 14 12 10 8 6 4 2 0 4 6 8 10 12 14 16

20 Regression Order of 4 18 16 14 12 10 8 6 4 2 0 4 6 8 10 12 14 16

20 Regression Order of 5 18 16 14 12 10 8 6 4 2 0 4 6 8 10 12 14 16

20 Regression Join the Dots 18 16 14 12 10 8 6 4 2 0 4 6 8 10 12 14 16

What Do We Want? We want the method that is going to predict future data most accurately, assuming they are drawn from the distribution P.

What Do We Want? We want the method that is going to predict future data most accurately, assuming they are drawn from the distribution P. Niels Bohr: "It is very difficult to make an accurate prediction, especially about the future."

Choosing the Best Model For a sample (x, y) which is drawn from the distribution function P : or f x y 2 f x y Since (x, y) is a r.v. we are usually interested in: E [ f x y 2 ]

Choosing the Best Model (cont.) Choose the parameter(s) : argmin E [ f x y 2 ] The problem is that we don't know to sample from P.

Solution - Cross Validation Partition the data to 2 sets: Training set T. Test set S. Calculate using only the training set T. Given, calculate 1 S x i, y i S f x i y i 2

Back to the Example In our case, we should try different orders for the regression (or different # of params). Each time apply the regression only on the training set, and calculate estimation error on the test set. The # of parameters will be the one minimizing the error.

Variants of Cross Validation Test - set. Leave one out. k-fold cross validation.

ROC Plot (Receiver Operating Characteristic)

Definitions Let f : R n { 1,1 } be a classifier function. Positive examples Negative examples Predicted positive True positives False positives Predicted negative False negatives True negatives

Example - Blood Pressure and Cardio Vascular Disease (CVD) Classifier: If a person has a mean blood pressure above t, he will have some CV event during 10 years. We have 100 samples. How do we choose t?

t = 0 Positive examples Negative examples Predicted positive Predicted negative 70 0 30 0

t = 300 Positive examples Negative examples Predicted positive Predicted negative 0 70 0 30

t = 150 Positive examples Negative examples Predicted positive Predicted negative 30 40 10 20

More Definitions Positive examples Negative examples Predicted positive TP FP Predicted negative FN TN True positive rate = TP / (TP + FN) False positive rate = FP / (FP + TN)

ROC - Receiver Operating Characteristic Curve 1 TP rate 0 FP rate 1

ROC Curve 1 TP rate 0 FP rate 1

ROC Curve 1 TP rate You are healthy! 0 FP rate 1

ROC Curve 1 TP rate 0 FP rate 1

ROC Curve You are sick! 1 TP rate 0 FP rate 1

ROC Curve 1 TP rate 0 FP rate 1

ROC Curve 1 TP rate 0 FP rate 1

ROC Curve Heaven 1 TP rate 0 FP rate 1

ROC Curve 1 TP rate 0 FP rate 1

ROC Curve 1 TP rate??? 0 FP rate 1

ROC Curve 1 TP rate 0 FP rate 1

ROC Curve 1 TP rate 0 FP rate 1

ROC Curve 1 TP rate Worse than random classifier 0 FP rate 1

ROC Curve 1 TP rate 0 FP rate 1

ROC Curve 1 TP rate 0 FP rate 1

ROC Curve 1 TP rate 0 FP rate 1

ROC Curve 1 TP rate 0 FP rate 1

Choosing a Point on the Curve Depends on the application: Medical screening tests (e.g., mammography) - high TP. Spam filtering - low FP.

The AUC Metric The Area Under the Curve (AUC) metric assesses the accuracy of the ranking in terms of separation of the classes. In random classifier (bad): AUC = 0.5. In perfect classifier (good): AUC = 1.

Summary Methods for: Estimation properties of an estimator. Model selection.