SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Similar documents
Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Evaluation. Andrea Passerini Machine Learning. Evaluation

Machine Learning Linear Classification. Prof. Matteo Matteucci

Evaluation requires to define performance measures to be optimized

Performance Evaluation and Comparison

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Stephen Scott.

day month year documentname/initials 1

FINAL: CS 6375 (Machine Learning) Fall 2014

Classifier performance evaluation

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Introduction to Supervised Learning. Performance Evaluation

Data Privacy in Biomedicine. Lecture 11b: Performance Measures for System Evaluation

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Bayesian Decision Theory

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Machine Learning Concepts in Chemoinformatics

Classification and Pattern Recognition

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

ECE521 week 3: 23/26 January 2017

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Data Mining and Analysis: Fundamental Concepts and Algorithms

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

Nonlinear Classification

Diagnostics. Gad Kimmel

The exam is closed book, closed notes except your one-page cheat sheet.

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Hypothesis Evaluation

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Machine Learning Lecture 5

CS145: INTRODUCTION TO DATA MINING

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Data Mining algorithms

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

PATTERN RECOGNITION AND MACHINE LEARNING

Modern Information Retrieval

Feature Engineering, Model Evaluations

Performance Evaluation

2018 EE448, Big Data Mining, Lecture 4. (Part I) Weinan Zhang Shanghai Jiao Tong University

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Algorithm-Independent Learning Issues

Logistic Regression. COMP 527 Danushka Bollegala

Pattern Recognition and Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

ECE 5424: Introduction to Machine Learning

Pointwise Exact Bootstrap Distributions of Cost Curves

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Supervised Learning. George Konidaris

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Course in Data Science

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

COMS 4771 Introduction to Machine Learning. Nakul Verma

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Performance Evaluation

Introduction to Machine Learning Midterm Exam

Multivariate statistical methods and data mining in particle physics

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

L11: Pattern recognition principles

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2014

CSCI-567: Machine Learning (Spring 2019)

Linear Models for Classification

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Kernel Methods and Support Vector Machines

UVA CS 4501: Machine Learning

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning. Lecture 9: Learning Theory. Feng Li.

CS6220: DATA MINING TECHNIQUES

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Generative v. Discriminative classifiers Intuition

VBM683 Machine Learning

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

Statistical Machine Learning from Data

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Department of Computer and Information Science and Engineering. CAP4770/CAP5771 Fall Midterm Exam. Instructor: Prof.

Midterm: CS 6375 Spring 2015 Solutions

How to evaluate credit scorecards - and why using the Gini coefficient has cost you money

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality. Weiqiang Dong

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Robotics 2. AdaBoost for People and Place Detection. Kai Arras, Cyrill Stachniss, Maren Bennewitz, Wolfram Burgard

Mining Classification Knowledge

Computer Vision Group Prof. Daniel Cremers. 10a. Markov Chain Monte Carlo

Learning Methods for Linear Detectors

Data Mining. Preamble: Control Application. Industrial Researcher s Approach. Practitioner s Approach. Example. Example. Goal: Maintain T ~Td

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Holdout and Cross-Validation Methods Overfitting Avoidance

Logistic Regression. Machine Learning Fall 2018

CS534 Machine Learning - Spring Final Exam

CS145: INTRODUCTION TO DATA MINING

Chapter 14 Combining Models

The Naïve Bayes Classifier. Machine Learning Fall 2017

Support Vector Machines. Machine Learning Fall 2017

Be able to define the following terms and answer basic questions about them:

Transcription:

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1

Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2

Terminology Instance A real-world object Abstractly described through feature values, x = x 1,, x n Class Set of similar instances Typically described by a class label height pears apples circumference 3

Discrete features Features Categorical features, i.e., features without ordering or scale (e.g., Boolean features, IDs of specific terms contained in a document, ) Ordinal features, i.e., features that can be ordered but do not have a scale (e.g., Running IDs, date of an event, house numbers, seat numbers, ) Continuous features Real-valued features (i.e., with ordering and scale), e.g., height, weight, time, Feature type Order Scale Tendency Dispersion Categorical X X mode n/a Ordinal X median quantiles Continuous mean Range, variance, standard deviation 4

Frequency Feature transformations From To Categorical Ordinal Continuous Categorical Grouping Ordering Calibration Ordinal Unordering Ordering Calibration Continuous Discretization Discretization Normalization Example Prediction of creditworthiness; one of the features is the credit amount Bad guys Good guys Accuracy of Naïve Bayes 0.76% Accuracy of Naïve Bayes 0.75% Credit amount 5

Discretization of continuous features Discretization Merging / Splitting Supervised Unsupervised Dependency, entropy Groups/bins based on χ 2, mutual information, entropy, MDL, and correlation analysis Accuracy, error Groups/bins based on prediction analysis Distance, similarity Binning strategies + cluster, MDL analysis 6

Feature selection Statistical process of choosing good features for specific classification task (e.g., χ 2 -test of independence, mutual information/information gain, correlation analysis, ) Features should be Not too general Not too specific Possibly independent of each other Possibly with strong class correlation Feature selection and transformation depends on the data at hand and the underlying classification task Rule 1: Get to know your data Rule 2: Make plausible assumptions (e.g. height in a population is normally distributed, income distribution is highly skewed and peaked, ) 7

Terminology: Training and test set height pears test apples Training set Pairs x, l x of instances and corresponding class labels used to train a classification model Test set Instances for which the classes are known but hidden to test the classifier Supervised learning Learning with training and test set circumference 8

Linear classification models height pears test apples Naive Bayes Perceptron Winnow Logistic Regression Support Vector Machines circumference 9

Non-linear classification models height pears test apples Rule-based classifiers K-Nearest Neighbors Hierarchical classifiers Ensemble classifiers Kernel methods Neural networks circumference 10

Multi-class classification In binary classification instance can belong either to C or to C For k > 2 different classes C 1,, C k One-of classification: Instance can belong to only one of the k classes Any-of classification: Instance can belong to many or none of the k classes Examples of binary classifiers that naturally generalize to multiclass classifiers Naive Bayes Decision trees K-nearest neighbors Neural networks In general, geometric models (e.g., Support Vector Machines, Winnow, etc.) do not generalize to multiclass classifiers 11

One-of vs. any-of classification One-of classification: Instance can belong to only one of the k classes Typical solution: 1. Build a classifier for each C i and its complement C i 2. For each instance, apply each classifier separately 3. Assign the instance to the class with maximum score (where the score represents how well the instance fits the class) Any-of classification: Instance can belong to many or none of the k classes Typical solution: 1. Build a classifier for each C i and its complement C i 2. The decision of one classifier has no influence on the decisions of the other classifiers 12

Training and validation Instance-class pairs x i, l x i for data x 1,, x n+m and l: x i {y 1,, y k } Training set x i, l x i for data x 1,, x n and l: x i {y 1,, y k } Test set (x n+l,? ),1 l m Learn model Validate Trained model Validated model 13

Overfitting Refers to cases where an algorithm s performance is much better on the training than on the test set May occur when Training set is small Training instances are not representative Parameters are set to the best performing values on the training set Learning is performed for too long (e.g., for neural networks) When the dimensionality of the data is high 14

Curse of dimensionality Instance space grows exponentially with increasing number of dimensions Training data becomes quickly non-representative ( overfitting) Example from C. Bishop, PRML Distribution of data in high-dimensional space may be counter-intuitive Example 1: Distance-based similarities become non-discriminative (why?) Example 2: What is the fraction of the volume lies between 1 and 1 ε in a sphere of radius r = 1? V r = αr D V 1 V(1 ε) α α(1 ε)d = = 1 (1 ε) D V(1) α For large D most of the volume is concentrated near the surface. 15

Cross validation How to adjust model parameters to good values and avoid overfitting? Test fold 10 folds N-fold cross validation Partition the dataset randomly in N folds (ideally, the frequency of each class in a fold should be proportional to its frequency in the full dataset). Repeat for each fold F i Train the model on the remaining N-1 folds and test on F i Estimate the error err i on F i Compute the weighted average of the parameter values by taking errors into account Leave-one-out cross validation N-fold cross validation where each fold is a single instance (N is the number of instances) 10-fold cross validation 16

Alternative to cross validation Bootstrapping Sample uniformly from a dataset D with n elements n times with replacement Use the n sampled elements as training set, and the elements from D that were not sampled as test set Repeat the above two steps m times What is the expected size of the training and test set? Expected fraction of instances in the test set is 1 1 n n 1 e = 0.368 Expected fraction of instances in the training set is 0.632 Compute weighted average of the parameter values by taking the error from each of the m runs into account 17

Model selection through validation Given different prediction algorithms for the same data, which one should we select? There are different possibilities Choose the algorithm with the lowest average error (or highest average prediction accuracy) from the N runs of cross validation Choose the algorithm that minimizes the following bootstrapping error m err = 1 m i=1 0.632 err i,test + 0.368 err i,training 18

Error types Classification error Training error: Fraction of misclassifications in training set Test error: Fraction of misclassifications in the test set Generalized error For data D = x 1,, x n P h x l x and classification algorithm h the generalized error is = E test error = err h 19

Probability of misclassification Bayes error P(A, x) P(B, x) R 1 R 2 Algorithm classifies x as A Decision Algorithm classifies x as B boundary err Bayes = xεr 1 P B x P(x) dx + xεr 2 P A x P(x) dx A classifier that minimizes the Bayes error is called a Bayes optimal classifier. 20

Classification loss (1) Consider a model for predicting whether someone has the disease or not predicted ground truth disease ok disease ok 0 1000 10 0 l i,j : loss matrix Probabilistic loss function (weighted Bayes error) Let c = (c 1,, c k ) be the target classes to which an instance x can belong with some probability Expected loss is given by the weighted Bayes error: L Bayes = l i,j P c i x P(x) dx i j x R j is minimized when each x is assigned to the class j that minimizes i l i,j P c i x 21

Classification loss (2) Loss functions For instance x, let y be the predicted and t the true class. 0-1 loss: l y, t = y x = 1, y t 0, y = t (indicator function) Absolute loss: l y, t = y t Information loss: l t, x = log P t x, t is true class of x log 1 P t x, otherwise Quadratic loss: l y, t = c y t 2, for constant c Exponential loss: l y, x = 1 n i e y sign(x)

Model selection through hypothesis testing Let h and h be two models and err 1,, err k and err 1,, err k the respective error values derived from k-fold cross validation Are the error means any different? Fact: err and err are approximately normally distributed d = err err is t-distributed, with k-1 degrees of freedom P t k 1,1 α/2 ( d 0) k t s k 1,1 α/2 = 1 α d If t k 1,1 α/2 < ( d 0) k s d reject H 0 (i.e., μ = μ ) otherwise retain it What if H 0 is d = e e 0 ( H 1 is d = e e < 0)? Left-sided test: P t k 1,1 α ( d 0) k s d = 1 α If ( d μ d ) k s d < t k 1,1 α reject H 0 otherwise retain it 23

Evaluation measures: Accuracy Let h be a prediction model for class C Actual values Values predicted by h C C C # True Positives (TP) # False Negatives (FN) C # False Positives (FP) # True Negatives (TN) Confusion matrix Decision threshold Accuracy h = TP + TN TP + FN + FP + TN What does accuracy capture? It captures the success rate, but does not say anything about prediction power! Is an accuracy of 99% good? 24

Evaluation measures: Precision and Recall Let h be a prediction model for class C C Actual values C Reflects decision threshold Values predicted by h C C # True Positives (TP) # False Negatives (FN) # False Positives (FP) # True Negatives (TN) Precision h = TP TP + FP ; Recall h = TP TP + FN What does Precision represent? Probability that h is correct whenever it predicts C What does Recall represent? Probability that h recognizes an instance from C 25

Evaluation measures: Sensitivity and Specificity Let h be a prediction model for class C Actual values C C Values predicted by h C C # True Positives (TP) # False Negatives (FN) # False Positives (FP) # True Negatives (TN) Sensitivity h = Recall h = TP TP + FN ; Specificity h = TN TN + FP What does Specificity represent? Probability that h recognizes an instance from C 26

Evaluation measures: F1-measure Let h be a prediction model for class C C Actual values C Values predicted by h C C # True Positives (TP) # False Negatives (FN) # False Positives (FP) # True Negatives (TN) F α h = 1 + α2 Precision h Recall h α 2 Precision h + Recall h F 1 h = 2 (Precision h Recall(h)) Precision h + Recall(h) The F1-score is the harmonic mean between Precision and Recall 27

Evaluation measures: Precision-recall curve Let h be a prediction model for class C better performance BreakEven=0.66 worse performance With decreasing decision threshold, plot precision recall values BreakEven = v such that Precision h = v = Recall h 28

Evaluation measures: Receiver-Operating Characteristic curve Let h be a prediction model for class C; for decreasing decision threshold compute TP Sensitivity = Recall = True Positive Rate = TP + FN FP 1 Specificity = False Postitive Rate = FP + TN better performance Often reported as performance measure: area under the curve (AUC), which represents the ranking accuracy x Pos,x Neg rank x > rank x + Pos Neg 1 2 rank x = rank x 29

Test set items Example of ROC curve construction For decreasing decision threshold with respect to prediction probabilities TP Sensitivity = Recall = True Positive Rate = TP + FN FP 1 Specificity = False Postitive Rate = FP + TN 30

ROC-based calibration of decision thresholds Let h be a prediction model for class C; for decreasing decision threshold compute TP Sensitivity = Recall = True Positive Rate = TP + FN FP 1 Specificity = False Postitive Rate = FP + TN Uppermost point of ROC curve that touches line of slope 1 corresponds to highest average Recognition probability of positive and negative class (i.e., highest avg. Recall for both classes) better performance For maximum accuracy the slope has to be Neg Pos 31

Evaluation measures for multi-class classification (1) Let h be a prediction model for classes C 1,., C n Actual values C i C i Values predicted by h C i C i # True Positives (TP i ) # False Negatives (FN i ) # False Positives (FP i ) # True Negatives (TN i ) Prec micro h = i TP i i TP i + FP i ; Prec macro h = 1 n i TP i TP i + FP i Rec micro h = i TP i i TP i + FN i ; Rec macro h = 1 n i TP i TP i + FN i F 1micro h = 2 Prec micro Rec micro ; F 1macro h = 1 F Prec micro + Rec micro n 1 C i i 32

Evaluation measures for multi-class classification (2) In a one-against-all fashion (i.e., class of interest is the positive class and all the other classes together are the negative class) one could analyze ROC curve of for each class Precision-Recall behavior of the classifier for each class Accuracy evaluation for each class Overall performance could be reported as the weighted average of precision, recall, accuracy (i.e., weighted by the proportion of instances in each class) 33

Which measure to use For deeper analysis of thresholds or other parameters ROC curves Precision-recall curves Error/loss curves Statistical analysis If threshold or parameter analysis is not an issue Precision Recall Accuracy Specificity Prediction error 34