Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Similar documents
CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Performance Evaluation and Comparison

Stephen Scott.

Machine Learning. Lecture Slides for. ETHEM ALPAYDIN The MIT Press, h1p://

How do we compare the relative performance among competing models?

Hypothesis Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Stats notes Chapter 5 of Data Mining From Witten and Frank

Evaluating Hypotheses

Introduction to Supervised Learning. Performance Evaluation

Evaluation requires to define performance measures to be optimized

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining and Analysis: Fundamental Concepts and Algorithms

Performance Evaluation and Hypothesis Testing

Diagnostics. Gad Kimmel

Performance evaluation of binary classifiers

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

E. Alpaydın AERFAISS

Data Mining. Chapter 5. Credibility: Evaluating What s Been Learned

Performance Evaluation

Evaluation & Credibility Issues

Hypothesis tests

Pointwise Exact Bootstrap Distributions of Cost Curves

Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn. Some overheads from Galit Shmueli and Peter Bruce 2010

Empirical Evaluation (Ch 5)

Performance Evaluation

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

CSC314 / CSC763 Introduction to Machine Learning

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Empirical Risk Minimization, Model Selection, and Model Assessment

Classifier performance evaluation

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

CS 543 Page 1 John E. Boon, Jr.

Model Accuracy Measures

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

One-Way Analysis of Variance. With regression, we related two quantitative, typically continuous variables.

Sociology 6Z03 Review II

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Notes on Machine Learning for and

Bayesian Decision Theory

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Machine Learning Linear Classification. Prof. Matteo Matteucci

Introductory Econometrics. Review of statistics (Part II: Inference)

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Least Squares Classification

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Summary of Chapters 7-9

Performance Measures. Sören Sonnenburg. Fraunhofer FIRST.IDA, Kekuléstr. 7, Berlin, Germany

Visual interpretation with normal approximation

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

Hypothesis Testing. ECE 3530 Spring Antonio Paiva

Algorithmisches Lernen/Machine Learning

Machine Learning: Evaluation

Practice Problems Section Problems

Null Hypothesis Significance Testing p-values, significance level, power, t-tests

Machine Learning Concepts in Chemoinformatics

CHAPTER 9: HYPOTHESIS TESTING

Null Hypothesis Significance Testing p-values, significance level, power, t-tests Spring 2017

CHAPTER EVALUATING HYPOTHESES 5.1 MOTIVATION

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Question of the Day. Machine Learning 2D1431. Decision Tree for PlayTennis. Outline. Lecture 4: Decision Tree Learning

CHAPTER 8. Test Procedures is a rule, based on sample data, for deciding whether to reject H 0 and contains:

Last two weeks: Sample, population and sampling distributions finished with estimation & confidence intervals

Learning with multiple models. Boosting.

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

Data Mining and Knowledge Discovery: Practice Notes

Introduction to Signal Detection and Classification. Phani Chavali

Probability and Statistics. Terms and concepts

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

EC2001 Econometrics 1 Dr. Jose Olmo Room D309

INTERVAL ESTIMATION AND HYPOTHESES TESTING

Performance Evaluation

CIVL /8904 T R A F F I C F L O W T H E O R Y L E C T U R E - 8

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Bayesian Learning (II)

Introduction to Machine Learning CMU-10701

CC283 Intelligent Problem Solving 28/10/2013

STAT 135 Lab 5 Bootstrapping and Hypothesis Testing

Sampling Distributions: Central Limit Theorem

hypotheses. P-value Test for a 2 Sample z-test (Large Independent Samples) n > 30 P-value Test for a 2 Sample t-test (Small Samples) n < 30 Identify α

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Empirical Risk Minimization Algorithms

Data Privacy in Biomedicine. Lecture 11b: Performance Measures for System Evaluation

Machine Learning in Action

10.4 Hypothesis Testing: Two Independent Samples Proportion

Last week: Sample, population and sampling distributions finished with estimation & confidence intervals

Modern Methods of Statistical Learning sf2935 Lecture 5: Logistic Regression T.K

Summary: the confidence interval for the mean (σ 2 known) with gaussian assumption

FINAL: CS 6375 (Machine Learning) Fall 2014

Models, Data, Learning Problems

AMS7: WEEK 7. CLASS 1. More on Hypothesis Testing Monday May 11th, 2015

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Business Statistics: Lecture 8: Introduction to Estimation & Hypothesis Testing

Transcription:

Smart Home Health Analytics Information Systems University of Maryland Baltimore County 1

IEEE Expert, October 1996 2

Given sample S from all possible examples D Learner L learns hypothesis h based on S Sample error: error S (h) True error: error D (h) Example Hypothesis h misclassifies 12 of 40 examples in S error S (h) = 0.3 What is error D (h)? 3

Learner A learns hypothesis h A on sample S Learner B learns hypothesis h B on sample S Observe: error S (h A ) < error S (h B ) Is error D (h A ) < error D (h B )? Is learner A better than learner B? 4

How can we estimate the true error of a classifier? How can we determine if one learner is better than another? Using sample error is too optimistic Using error on a separate test set is better, but might still be misleading Repeating above for multiple iterations, each with different training/testing sets, yields better estimate of true error

David Wolpert, 1995 For any learning algorithm there are datasets for which it does well, and datasets for which is does poorly Performance estimates are based on specific datasets, not an estimate of the learner on all datasets There is no one best learning algorithm 6

Multiple iterations of learning on a training set and testing on a separate validation set are only for evaluation and parameter tuning Final learning should be done on all available data If the validation set is used to choose/tune a learning method, then it cannot also be used to compare performance against another learning algorithm Need yet another test set that is unseen during tuning/learning 7

Error costs (false positives vs. false negatives) Training time and space complexity Testing time and space complexity Interpretability Ease of implementation 8

Given dataset X For each of K trials Randomly divide X into training set (2/3) and testing set (1/3) Learn classifier on training set Test classifier on testing set (compute error) Compute average error over K trials Problem Training and testing sets overlap between trials Biases the results 9

Given dataset X Partition X into K disjoint sets X 1,, X K For i = 1 to K Learn classifier on training set X X i Test classifier on testing set X i (compute error) Compute average error over K trials Testing sets no longer overlap Training sets still overlap 10

Stratification Distribution of classes in training and testing sets should be the same as in original dataset Called stratified cross validation Leave-one-out cross validation K = N = X Used when classified data is scarce Medical diagnosis 11

Tom Dietterich, 1998 For each of 5 trials (shuffling X each time) Divide X randomly in two halves X 1 and X 2 Compute error using X 1 as training and X 2 as testing Compute error using X 2 as training and X 1 as testing Compute average error of all 10 results 5 trials best number to minimize overlap among training and testing sets 12

If not enough data for k-fold cross validation Generate multiple samples of size N from X by sampling with replacement Each sample has approximately 63% of the examples in X Compute average error over all samples 13

Draw instances from a dataset with replacement Prob that we do not pick an instance after N draws N 1 1 1 e 0. 368 N that is, only 36.8% is new!

Confusion matrix Predicted class True class Positive Negative Total Positive tp: true positive fn: false negative p Negative fp: false positive tn: true negative n Total p n N 15

Name error accuracy tp-rate fp-rate precision recall sensitivity specificity Formula (fp + fn)/n (tp + tn)/n tp/p fp/n tp/p tp/p = tp_rate tp/p = tp_rate tn/n = 1 fp_rate F-measure: F 2 precisionrecall precision recall 16

Error rate = # of errors / # of instances = (FN+FP)/N Precision = # of found positives / # of found = TP / (TP+FP) Recall = # of found positives / # of positives = TP / (TP+FN) = sensitivity = hit rate Specificity = TN / (TN+FP) False alarm rate = FP / (FP+TN) = 1 - Specificity

Sensitivity is the same as tp-rate and recall Specificity is how well we detect the negatives # of true negatives / total # of negatives 1 false alarm rate Sensitivity vs. specificity curve for different thresholds 19

=== Run information === Scheme: weka.classifiers.rules.oner -B 6 Relation: labor-neg-data Instances: 57 Attributes: 17 duration wage-increase-first-year wage-increase-second-year wage-increase-third-year cost-of-living-adjustment working-hours pension standby-pay shift-differential education-allowance statutory-holidays vacation longterm-disability-assistance contribution-to-dental-plan bereavement-assistance contribution-to-health-plan class 20

Test mode: 10-fold cross-validation === Classifier model (full training set) === wage-increase-first-year: < 2.9 -> bad >= 2.9 -> good? -> good (43/57 instances correct) Time taken to build model: 0 seconds 21

=== Stratified cross-validation === === Summary === Correctly Classified Instances 43 75.4386 % Incorrectly Classified Instances 14 24.5614 % Kappa statistic 0.4063 Mean absolute error 0.2456 Root mean squared error 0.4956 Relative absolute error 53.6925 % Root relative squared error 103.7961 % Coverage of cases (0.95 level) 75.4386 % Mean rel. region size (0.95 level) 50 % Total Number of Instances 57 22

=== Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.45 0.081 0.75 0.45 0.563 0.684 bad 0.919 0.55 0.756 0.919 0.829 0.684 good Weighted Avg. 0.754 0.385 0.754 0.754 0.736 0.684 === Confusion Matrix === a b <-- classified as 9 11 a = bad 3 34 b = good Predicted class True class Class a Class b Total Class a 9 11 20 Class b 3 34 37 Total 12 45 57 23

Most comparisons of machine learning algorithms use classification error Problems with this approach May be different costs associated with false positive and false negative errors Training data may not reflect true class distribution 24

Receiver Operating Characteristic (ROC) Originated from signal detection theory Common in medical diagnosis Becoming common in ML evaluations ROC curves assess predictive behavior independent of error costs or class distributions Area Under ROC Curve (AUC) Single measure of learning algorithm performance independent of error costs and class distributions 25

True positive rate 1.0 0.75 0.5 0.25 0 0 0.25 0.5 0.75 False positive rate Learner L1 Learner L2 Learner L3 Random 1.0 26

Learner L1 dominates L2 if L1 s ROC curve is always above L2 s curve If L1 dominates L2, then L1 better than L2 for all possible error costs and class distributions If neither dominates (L2 and L3), then different classifiers are better under different conditions 27

Assume classifier outputs P(C x) instead of just C (the predicted class for instance x) Let θ be a threshold such that if P(C x) > θ, then x is classified as C, else not C Compute fp-rate and tp-rate for different values of θ from 0 to 1 Plot each (fp-rate, tp-rate) and interpolate (or convex hull) If multiple points with same fp-rate, then average tp-rates (k-fold cross-validation) 28

What if classifier does not provide P(C x), but just C? E.g., decision tree, rule Generally, even these discrete classifiers maintain statistics for classification E.g., decision tree leaf nodes use proportion of examples of each class E.g., rules have the number of examples covered by the rule These statistics can be compared against a varying threshold (θ) 29

True Positive Rate ROC for J48 vs NN on Labor 1.2 1 0.8 0.6 J48 NN 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 1.2 False Positive Rate 31

We have seen several ways to estimate learning performance Train/test split, cross-validation, ROC, AUC But how good are these at estimating the true performance? E.g., error S (h) ~ error D (h)? 33

Estimate the mean μ of a normal distribution N(μ, σ 2 ) Given sample X = {x t } of size N Estimate m = Σ t x t /N, where m ~ N (μ, σ 2 /N) Define statistic Z with a unit normal distribution N(0,1): m / N ~ Z 34

95% of Z lies in (-1.96,1.96) 99% of Z lies in (-2.58, 2.58) P(-1.96 < Z < 1.96) = 0.95 Two-sided confidence interval 35

36 1 0.95 1.96 1.96 0.95 1.96 1.96 2 / 2 / N z m N z m P N m N m P m N P z α/2 1-α 2.58 0.99 2.33 0.98 1.96 0.95 1.64 0.90 1.28 0.80 1.00 0.68 0.67 0.50

One-sided: z α 2.33 1.64 1.28 1-α 0.99 0.95 0.90 Pm Pm 1.64 z N 0.95 N 1 37

Previous analysis requires we know σ 2 We can use sample variance S 2 t 2 x m / N 2 S 1 t When x t ~ N(μ, σ 2 ), then (N 1)S 2 /σ 2 is chisquared with N 1 degrees of freedom Since m and S 2 are independent, then m is t-distributed with N 1 degrees of freedom N ( ) / S 38

Similar to normal, but with larger spread (longer tails) Corresponds to additional uncertainty with using sample variance 39

When σ 2 not known E.g., t 0.025,9 =2.685/2.262, t 0.025,29 =2.364/2.045 (2-tailed) 40 1 1 1 2 1 2 1 2 2 N S t m N S t m P t S m N N m x S N N N t t, /, / ~ /

t x t 1 3.0 2 3.1 3 3.2 4 2.8 5 2.9 6 3.1 7 3.2 8 2.8 9 2.9 10 3.0 30 m 3.0 10 2 0.2 S 0.022, 9 0.05, df N P P 3 0.127 1 0.149 9, 3 0.127 2.873 3.127 0. 95 S t 0.025,9 0.95 2.685 Pm t / 2, N S S 1 m t / 2, N 1 1 N N 41

Want to claim a hypothesis H 1 E.g., H 1 : error D (h) < 0.10 Define the opposite of H 1 to be the null hypothesis H 0 E.g., H 0 : error D (h) 0.10 Perform experiment collecting data about error D (h) With what probability can we reject H 0? 42

Example Sample X = {x t } of size N from N(μ, σ 2 ) Estimate mean m = Σ t x t /N Want to test if μ equals some constant μ 0 Null hypothesis H 0 : μ = μ 0 Alternative hypothesis H 1 : μ μ 0 Reject H 0 if m too far from μ 0 43

Example (cont.) We fail to reject H 0 with level of significance α if μ 0 lies in the (1- α) confidence interval: N m 0 We reject H 0 if μ 0 falls outside this interval on either side (two-sided test) z z / 2, / 2 44

Example (cont.) One-sided test H 0 : μ μ 0 vs. H 1 : μ > μ 0 Fail to reject H 0 with level of significance α if N Reject H 0 if outside interval m 0, z 45

Example (cont.) If variance σ 2 unknown, use sample variance S 2 Statistic now described by student-t distribution N Fail to reject H 0 with level of significance α if N Reject H 0 if outside interval m S m S 0 ~ t N 1 0,, N 1 t 46

Example (cont.) H 0 : μ μ 0 vs. H 1 : μ > μ 0 (one-sided) 30 0 2.9, m 3.0 10 2 0.2 S 0.022, S 0.149 9 0.05, df N 1 9, t N ( m 0) S Reject H 0 0.05,9 2.121(,1.833) 1.833 Note that t 0.03145,9 = 2.121 t x t 1 3.0 2 3.1 3 3.2 4 2.8 5 2.9 6 3.1 7 3.2 8 2.8 9 2.9 10 3.0 47

Learn classifier on training set Test classifier on test set V of size N Assume probability p of error by classifier X = number of errors made by classifier on V X described by binomial distribution P N j N j X j p 1 p j 48

Test hypothesis H 0 : p p 0 vs. H 1 : p > p 0 Reject H 0 with significance α if j N j X e p 1 p where e = p 0 N N e P j1 j 0 0 49

Single training/validation set: Binomial Test If error prob is p 0, prob that there are e errors or less in N validation trials is e N j j N j P X e p0 1 p0 j1 j 1- α N=100, e=20 Accept if this prob is less than 1- α 50

Number of errors X is approx N with mean Np 0 and var Np 0 (1-p 0 ) X Np Np 0 0 1 p 0 ~ Z Accept if this prob for X = e is less than z 1-α 1- α 51

Approximating X with normal distribution X is sum of N independent random variables from same distribution X/N is approximately normal for large N with mean p 0 and variance p 0 (1- p 0 )/N (central limit theorem) X / N p0 ~ Z p01 p0 / N Fail to reject H 0 (p p 0 ) with significance α if X p p Reject H 0 if outside 0 / N (1 0 p ) / N 0, z (e.g., z 0.05 = 1.64) Works well for Np 5 and N(1-p) 5 52

Example Recall earlier example error S (h)=0.3, N = S = 40, X = 12 error D (h) = p (?) H 0 : p p 0, H 1 : p > p 0 X / N Let p 0 = 0.2, α = 0.05 Fail to reject H 0 p 0 (1 p 0 p 0 ) / N 0.3 0.2 (, z 0.2*0.8/ 40 1.58(,1.64), z 0.05 ) 53

Example (cont.) What is the 95% (α=0.05) confidence interval around error D (h) = p? Let p 0 = error S (h) = 0.3 P p 0 / 2 P0.3 1.96 P P z p 0.3(0.7) 40 0.3 0.142 p 0.3 0.158 p 0.442 0. 95 0 ( 1 p0) p0(1 p0) p p0 z / 2 N N p 0.3 1.96 0.142 0.95 0.3(0.7) 40 1 0.95 54

Evaluate learner on K training/testing sets yielding errors p i, 1 i K Reject H 0 with significance α if this value is greater than t α,k-1 Typically K is 10 or 30 (t 0.05,9 =1.83, t 0.05,29 =1.70) 55 1 0 t K S p m K ~ 1 ) (, 1 2 2 1 K m p S K p m K i i K i i

K-fold cross-validated paired t test Paired test: Both learners get same train/test sets Use K-fold CV to get K training/testing folds p i1, p i 2 : Errors of learners 1 and 2 on fold i p i = p i1 p i2 : Paired difference on fold i Null hypothesis is whether p i has mean 0 56 1 2, / 1 2, / 1 1 2 2 1 1 0, in Accept if ~ 0 1 0 : vs. 0 : K K K K i i K i i t t t s m K s m K K m p s K p m H H

Tester: weka.experiment.pairedcorrectedttester Analysing: Percent_correct Datasets: 8 Resultsets: 2 Confidence: 0.05 (two tailed) Sorted by: - Date: 10/6/10 12:00 AM Dataset (1) rules.on (2) bayes -------------------------------------------------- loan (100) 39.50 84.50 v contact-lenses (100) 72.17 76.17 iris (100) 93.53 95.53 labor-neg-data (100) 72.77 93.57 v segment (100) 63.33 81.12 v soybean (100) 39.75 92.94 v weather (100) 36.00 67.50 weather.symbolic (100) 38.00 57.50 -------------------------------------------------- (v/ /*) (4/4/0) 57

Be careful when comparing more than two learners Each comparison has probability α of yielding an incorrect conclusion Incorrectly reject null hypothesis Incorrectly conclude learner A better than learner B Probability of at least one incorrect conclusion among c comparisons is (1-(1-α) c ) One approach: Analysis of variance (ANOVA) 58

Evaluating a learning algorithm Error of learned hypotheses (and other measures) K-fold and 5x2 cross validation ROC curve and AUC Confidence in error estimate Comparing two learners