CSC314 / CSC763 Introduction to Machine Learning

Similar documents
Stephen Scott.

Hypothesis Evaluation

Performance Evaluation and Comparison

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

How do we compare the relative performance among competing models?

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Empirical Evaluation (Ch 5)

Diagnostics. Gad Kimmel

Evaluation. Andrea Passerini Machine Learning. Evaluation

CS 543 Page 1 John E. Boon, Jr.

Evaluation requires to define performance measures to be optimized

Machine Learning Linear Classification. Prof. Matteo Matteucci

Model Accuracy Measures

Lectures 5 & 6: Hypothesis Testing

Introduction to Supervised Learning. Performance Evaluation

Performance Evaluation and Hypothesis Testing

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Evaluating Hypotheses

Learning with multiple models. Boosting.

یادگیري ماشین. (Machine Learning) ارزیابی فرضیه ها دانشگاه فردوسی مشهد دانشکده مهندسی رضا منصفی. Evaluating Hypothesis (بخش سوم)

Empirical Risk Minimization, Model Selection, and Model Assessment

Machine Learning. Ensemble Methods. Manfred Huber

Non-parametric methods

Lecture 7: Hypothesis Testing and ANOVA

Pointwise Exact Bootstrap Distributions of Cost Curves

Lecture 3 Classification, Logistic Regression

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Computational Learning Theory

Classifier performance evaluation

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Performance Evaluation

When Dictionary Learning Meets Classification

Lecture Slides for INTRODUCTION TO. Machine Learning. ETHEM ALPAYDIN The MIT Press,

Stats notes Chapter 5 of Data Mining From Witten and Frank

CHAPTER EVALUATING HYPOTHESES 5.1 MOTIVATION

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Machine Learning (CS 567) Lecture 3

COMPUTATIONAL LEARNING THEORY

Introduction to Signal Detection and Classification. Phani Chavali

Performance Evaluation

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation

Anomaly Detection. Jing Gao. SUNY Buffalo

CS 6375: Machine Learning Computational Learning Theory

ORIE 4741: Learning with Big Messy Data. Train, Test, Validate

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

CSC321 Lecture 4: Learning a Classifier

Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn. Some overheads from Galit Shmueli and Peter Bruce 2010

Machine Learning (CS 567) Lecture 2

Introduction to Bayesian Learning. Machine Learning Fall 2018


Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Performance Evaluation

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Scribe to lecture Tuesday March

CS 5114: Theory of Algorithms

ECE521 week 3: 23/26 January 2017

Data Mining Part 4. Prediction

Notes on Machine Learning for and

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Linear Classifiers as Pattern Detectors

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Machine Learning. Lecture 9: Learning Theory. Feng Li.

MODULE -4 BAYEIAN LEARNING

Evaluation & Credibility Issues

Hypothesis Testing. Hypothesis: conjecture, proposition or statement based on published literature, data, or a theory that may or may not be true

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

Computational Learning Theory

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Notes on Machine Learning for and

A Brief Introduction to Adaboost

Machine Learning 2010

Data Mining. Chapter 5. Credibility: Evaluating What s Been Learned

Active Learning and Optimized Information Gathering

CSC321 Lecture 4: Learning a Classifier

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Performance Measures Training and Testing Logging

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Linear Classifiers: Expressiveness

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

This is particularly true if you see long tails in your data. What are you testing? That the two distributions are the same!

Module 9: Nonparametric Statistics Statistics (OA3102)

Data Privacy in Biomedicine. Lecture 11b: Performance Measures for System Evaluation

Statistical Learning Theory: Generalization Error Bounds

Lecture 2 Machine Learning Review

CPSC 340: Machine Learning and Data Mining. Regularization Fall 2017

The maximum margin classifier. Machine Learning and Bayesian Inference. Dr Sean Holden Computer Laboratory, Room FC06

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Introduction to Graphical Models

LECTURE NOTE #3 PROF. ALAN YUILLE

Hypothesis Testing and Confidence Intervals (Part 2): Cohen s d, Logic of Testing, and Confidence Intervals

Statistical Inference: Estimation and Confidence Intervals Hypothesis Testing

More Smoothing, Tuning, and Evaluation

Machine Learning: Exercise Sheet 2

Transcription:

CSC314 / CSC763 Introduction to Machine Learning COMSATS Institute of Information Technology Dr. Adeel Nawab

More on Evaluating Hypotheses/Learning Algorithms Lecture Outline: Review of Confidence Intervals for Discrete-Valued Hypotheses General Approach to Deriving Confidence Intervals Difference in Error of Two Hypotheses Comparing Learning Algorithms k-fold crossvalidation Disadvantages of Accuracy as a Measure Confusion matrices and ROC graphs Reading: Chapter 5 of Mitchell Chapter 5 of Witten and Frank, 2nd ed. Reference: M. H. DeGroot, Probability and Statistics, 2nd Ed., Addison-Wesley, 1986.

Review of Confidence Intervals for Discrete-Valued Hypotheses Last lecture we examined the question of how we might evaluate a hypothesis. More precisely: 1. Given a hypothesis h and a data sample S of n instances drawn at random according to D, what is the best estimate of the accuracy of h over future instances drawn from D? 2. What is the possible error in this accuracy esimate?

Review of Confidence Intervals for Discrete-Valued Hypotheses (cont...) In answer, we saw that assuming: the n instances in S are drawn independently of one another independently of h according to probability distribution D n 30 Then 1. the most probable value of error D (h) is error S (h) 2. With approximately N% probability, error D (h) lies in interval.

Review of Confidence Intervals for Discrete-Valued Hypotheses (cont...)

Review (cont) Equation (1) was derived by observing: error S (h) follows a binomial distribution with mean value error D (h) and standard deviation approximated by

Review (cont) the N% confidence level for estimating the mean of a normal distribution of a random variable Y with observed value y can be calculated by noting µ falls into y±znσ N% of the time.

Two-Sided and One-Sided Bounds Confidence intervals discussed so far offer two-sided bounds above and below May only be interested in one-sided bound E.g. may only care about upper bound on error answer to question: What is the probability that error D (h) is at most U? and not mind if error is lower than our estimate.

Two-Sided and One-Sided Bounds

General Approach to Deriving Confidence Intervals Central Limit Theorem Consider a set of independent, identically distributed random variables Y 1...Y n, all governed by an arbitrary probability distribution with mean µ and finite variance σ2. Define the sample mean,

General Approach to Deriving Confidence Intervals Central Limit Theorem Central Limit Theorem. As n, the distribution governing Y approaches a Normal distribution, with mean µ and variance σ 2 /n. Significance: we know the form of the distribution of the sample mean even if we do not know the distribution of the underlying Y i that are being observed. Useful because whenever we pick an estimator that is the mean of some sample (e.g. error S (h)), the distribution governing the estimator can be approximated by the Normal distribution for suitably large n (typically n 30). e.g. use Normal distribution to approximate Binomial distribution that more accurately describes error S (h).

General Approach to Deriving Confidence Intervals Now have a general approach to deriving confidence intervals for many estimation problems: 1. Pick parameter p to estimate e.g. error D (h) 2. Choose an estimator e.g. error S (h) 3. Determine probability distribution that governs estimator e.g. error S (h) governed by Binomial distribution, approximated by Normal when n 30

General Approach to Deriving Confidence Intervals 4. Find interval (Lower,Upper) such that N% of probability mass falls in the interval e.g Use table of zn values Things are made easier if we pick an estimator that is the mean of some sample then (by Central Limit Theorem) we can ignore the probability distribution underlying the sample and approximate the distribution governing the estimator by the Normal distribution.

Example: Difference in Error of Two Hypotheses Suppose we have two hypotheses h 1 and h 2 for a discrete-valued target function h 1 is tested on sample S 1, h 2 on S 2, S 1, S 2 independently drawn from the same distribution Wish to estimate difference d in true error between h 1 and h 2 Use the 4-step generic procedure to derive confidence interval for d:

Example: Difference in Error of Two Hypotheses

Comparing Learning Algorithms Suppose we want to compare two learning algorithms rather than two specific hypotheses. Not complete agreement in the machine learning community about best way to do this. One way to do this is to determine whether learning algorithm L A is better on average for learning a target function f than learning algorithm L B.

Comparing Learning Algorithms(Cont.) By better on average here we mean relative performance across all training sets of size n drawn from instance distribution D. I.e. we want to estimate: E S D [error D (L A (S)) error D (L B (S))] where L(S) is the hypothesis output by learner L using training set S i.e., the expected difference in true error between hypotheses output by learners L A and L B, when trained using randomly selected training sets S drawn according to distribution D.

Comparing Learning Algorithms(Cont ) But, given limited data D 0, what is a good estimator? could partition D 0 into training set S 0 and test set T 0, and measure error T 0 (L A (S 0 )) error T 0 (L B (S 0 )) even better, repeat this many times and average the results

Comparing Learning Algorithms(Cont ) Rather than divide limited training/testing data just once, do so multiple times and average the results called k-fold cross-validation: 1. Partition data D 0 into k disjoint test sets T 1, T 2,..., D K of equal size, where this size is at least 30.

Comparing Learning Algorithms(Cont.)

Comparing Learning Algorithms(Cont.)

Comparing Learning Algorithms Further Considerations Can determine approximate N% confidence intervals for estimator δ using a statistical test called a paired t test A paired test is one where hypotheses are compared over identical samples (unlike discussion of comparing hypotheses above). t test uses the t distribution (instead of Normal distribution).

Comparing Learning Algorithms Further Considerations Another paired test which is increasingly used is the Wilcoxon signed rank test Has advantage that unlike t test does not assume any particular distribution underlying the error (i.e. it is a non-parametric test) Rather than partitioning available data D 0 into k disjoint equal-sized partitions, can repeatedly randomly select a test set of n 30 examples from D 0 and use rest for training.

Comparing Learning Algorithms Further Considerations can do this indefinitely many times, to shrink confidence intervals to arbitrary width however test sets are no longer independently drawn from underlying instance distribution D, since instances will recur in separate test sets in k-fold cross validation each instance is included in only one test set

Disadvantages of Accuracy as a Measure (I)

Disadvantages of Accuracy as a Measure (I) Accuracy not always a good measure. Consider a two class classification problem where 995 of 1000 instances in a test sample are negative and 5 positive a classifier that always predicts negative will have an accuracy of 99.5 % even though it never correctly predicts positive examples

Confusion Matrices Can get deeper insights into classifier behaviour by using a confusion matrix

Confusion Matrices (cont)

Confusion Matrices (cont)

Confusion Matrices (cont)

Disadvantages of Accuracy as a Measure (II) Accuracy ignores possibility of different misclassification costs incorrectly predicting +ve costs may be more or less important than incorrectly predicting -ve costs not treating an ill-patient vs. treating a healthy one refusing credit to a credit-worthy client vs. denying credit a client who defaults

Disadvantages of Accuracy as a Measure (II) To address this many classifiers have parameters that can be adjusted to allow increased TPR at cost of increased FPR; or decreased FPR at the cost of decreased TPR. For each such parameter setting a (TPR,FPR) pair results, and the results may be plotted on a ROC graph (ROC = receiver operating characteristic ).

Disadvantages of Accuracy as a Measure (II) provides graphical summary of trade-offs between sensitivity and specificity term originated in signal detection theory e.g. identifying radar signals of enemy aircraft in noisy environments See Witten and Frank, Chapter 5.7 and http://en.wikipedia.org/wiki/receiver-operating-characteristic for more

ROC Graphs

ROC Graphs Example

Summary Confidence intervals give us a way of assessing how likely the true error of a hypothesis is to fall within an interval around an observed over a sample For many practical purposes we will be interested in a one-sided confidence interval only The approach to confidence intervals for sample error many be generalized to apply to any estimator which is the mean of some sample E.g. may use this approach to derive confidence interval for estimated difference in true error between two hypotheses

Summary Difference between learning algorithms, as opposed to hypotheses, are typically assessed by k-fold cross- validation Accuracy (complement of error) has a number of disadvantages as the sole measure of a learning algorithm Deeper insight may be obtained using a confusion matrix which allows use to distinguish numbers of false positives/negatives from true positives/negatives Costs of different classification errors may be taken into account using ROC graphs