IMBALANCED DATA. Phishing. Admin 9/30/13. Assignment 3: - how did it go? - do the experiments help? Assignment 4. Course feedback

Similar documents
ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Decision Trees: Overfitting

Vote. Vote on timing for night section: Option 1 (what we have now) Option 2. Lecture, 6:10-7:50 25 minute dinner break Tutorial, 8:15-9

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

COMS 4771 Introduction to Machine Learning. Nakul Verma

Performance Evaluation

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

ECE521 Lecture7. Logistic Regression

Jeffrey D. Ullman Stanford University

Data Mining und Maschinelles Lernen

Applied Machine Learning Annalisa Marsico

Classification and Regression Trees

Stephen Scott.

Multiclass Classification-1

Name (NetID): (1 Point)

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

CS 188: Artificial Intelligence. Outline

CS 6375 Machine Learning

Machine Learning, Fall 2009: Midterm

Decision Tree Learning Lecture 2

Final Exam, Machine Learning, Spring 2009

Linear classifiers Lecture 3

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Support vector machines Lecture 4

Evaluation. Andrea Passerini Machine Learning. Evaluation

Learning and Neural Networks

Evaluation requires to define performance measures to be optimized

CS 188: Artificial Intelligence Spring Today

Nonlinear Classification

Decision trees COMS 4771

A first model of learning

Midterm, Fall 2003

Introduction to Machine Learning CMU-10701

Introduction to Data Science Data Mining for Business Analytics

Supervised Learning. George Konidaris

Announcements Kevin Jamieson

Learning Decision Trees

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 6. Notes on Linear Algebra. Perceptron

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Linear and Logistic Regression. Dr. Xiaowei Huang

CS 188: Artificial Intelligence. Machine Learning

Support Vector Machines. Machine Learning Fall 2017

Linear classifiers: Logistic regression

Predictive Modeling: Classification. KSE 521 Topic 6 Mun Yi

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

Classification: Analyzing Sentiment

CS 188: Artificial Intelligence Spring Announcements

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

MIRA, SVM, k-nn. Lirong Xia

Learning with multiple models. Boosting.

Learning Decision Trees

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Decision Trees. Tobias Scheffer

Anomaly Detection. Jing Gao. SUNY Buffalo

Holdout and Cross-Validation Methods Overfitting Avoidance

Midterm: CS 6375 Spring 2015 Solutions

CSC321 Lecture 4 The Perceptron Algorithm

CSC 411 Lecture 3: Decision Trees

Classification: Analyzing Sentiment

CS 188: Artificial Intelligence Fall 2011

Performance evaluation of binary classifiers

CS 5522: Artificial Intelligence II

Homework #1 RELEASE DATE: 09/26/2013 DUE DATE: 10/14/2013, BEFORE NOON QUESTIONS ABOUT HOMEWORK MATERIALS ARE WELCOMED ON THE FORUM.

FINAL EXAM: FALL 2013 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Empirical Risk Minimization, Model Selection, and Model Assessment

Errors, and What to Do. CS 188: Artificial Intelligence Fall What to Do About Errors. Later On. Some (Simplified) Biology

Performance Evaluation

Performance Evaluation

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

10-701/ Machine Learning, Fall

CS 188: Artificial Intelligence Spring Announcements

CSC321 Lecture 22: Q-Learning

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Hypothesis Evaluation

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Gaussian and Linear Discriminant Analysis; Multiclass Classification

CS 543 Page 1 John E. Boon, Jr.

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

the tree till a class assignment is reached

The Perceptron Algorithm

Part of the slides are adapted from Ziko Kolter

Introduction to Logistic Regression

Induction of Decision Trees

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

A Posteriori Corrections to Classification Methods.

Generative Models for Classification

Introduction to Machine Learning

Lecture 5 Neural models for NLP

Using HDDT to avoid instances propagation in unbalanced and evolving data streams

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Transcription:

9/3/3 Admin Assignment 3: - how did it go? - do the experiments help? Assignment 4 IMBALANCED DATA Course feedback David Kauchak CS 45 Fall 3 Phishing

9/3/3 Setup Imbalanced data. for hour, google collects M e-mails randomly. they pay people to label them as or 3. they give the data to you to learn to classify e-mails as or not 4. you, having taken ML, try out a few of your favorite classifiers 5. You achieve an accuracy of 99.997% Should you be happy? 99.997% labeled data.3% The problem is what is called an imbalanced data problem This occurs where there is a large discrepancy between the number of examples with each class label e.g. for our M example dataset only about 3 would actually represent e-mails What is probably going on with our classifier? Imbalanced data Imbalanced data Many classifiers are designed to optimize error/accuracy always predict 99.997% accuracy This tends to bias performance towards the majority class Anytime there is an imbalance in the data this can happen Why does the classifier learn this? It is particularly pronounced, though, when the imbalance is more pronounced

9/3/3 Imbalanced problem domains Imbalanced problem domains Medical diagnosis Besides (and spam) what are some other imbalanced problems domains? Predicting faults/failures (e.g. hard-drive failures, mechanical failures, etc.) Predicting rare events (e.g. earthquakes) Detecting fraud (credit card transactions, internet traffic) Imbalanced data: current classifiers Imbalanced data: current classifiers All will do fine if the data can be easily separated/distinguished labeled data 99.997%.3% Decision trees: explicitly minimizes training error when pruning pick majority label at leaves tend to do very poor at imbalanced problems k-nn: even for small k, majority class will tend to overwhelm the vote How will our current classifiers do on this problem? perceptron: can be reasonable since only updates when a mistake is made can take a long time to learn 3

9/3/3 Part of the problem: evaluation Accuracy is not the right measure of classifier performance in these domains Other ideas for evaluation measures? identification tasks View the task as trying to find/identify positive examples (i.e. the rare events) Precision: proportion of test examples predicted as positive that are correct Recall: proportion of test examples labeled as positive that are correct identification tasks and Precision: proportion of test examples predicted as positive that are correct Recall: proportion of test examples labeled as positive that are correct data label predicted = = predicted positive all positive 4

9/3/3 and and data label predicted = data label predicted = = = = = 4 3 Why do we have both measures? How can we maximize? How can we maximize? Maximizing Maximizing data label predicted = data label predicted = = = Don t predict anything as positive! Predict everything as positive! 5

9/3/3 vs. / tradeoff Often there is a tradeoff between and increasing one, tends to decrease the other For our algorithms, how might be increase/decrease /? data label predicted confidence.75.6..8.5 - For many classifiers we can get some notion of the prediction confidence - Only predict positive if the confidence is above a given threshold - By varying this threshold, we can vary and.55.9 / tradeoff / tradeoff data label predicted confidence.8 put most confident positive predictions at top data label predicted confidence.8.6.55 put most confident negative predictions at bottom.6.55 / =.5 /3 =.33.5. calculate / at each break point/threshold.5..75.9 classify everything above threshold as positive and everything else negative.75.9 6

9/3/3 / tradeoff / tradeoff data label predicted confidence data label predicted confidence.8.8.6.6.55.55.5.5. 3/5 =.6 3/3 =...75.75.9.9 3/7 =.43 3/3 =. / tradeoff - curve data label predicted confidence.8.6.55..33.5.33.66.66 Precision..8.6.4.5.5.66...75.6..5.....4.6.8. Recall.9.43. 7

9/3/3 Which is system is better? Area under the curve Area under the curve (AUC) is one metric that encapsulates both and calculate the / values for all thresholding of the test set (like we did before) then calculate the area under the curve How can we quantify this? can also be calculated as the average for all the points Area under the curve? Area under the curve?? Any concerns/problems? For real use, often only interested in performance in a particular range Eventually, need to deploy. How do we decide what threshold to use? 8

9/3/3 Area under the curve? A combined measure: F Combined measure that assesses / tradeoff is F measure (weighted harmonic mean):? F ( β + ) PR = α + ( α) β P + R P R = Ideas? We d like a compromise between and F-measure A combined measure: F Most common α=.5: equal balance/weighting between and : F ( β + ) PR = α + ( α) β P + R P R = F=.5 P +.5 R = PR P + R Combined measure that assesses / tradeoff is F measure (weighted harmonic mean): F ( β + ) PR = α + ( α) β P + R P R = Why harmonic mean? Why not normal mean (i.e. average)? 9

9/3/3 F and other averages Evaluation summarized Combined Measures Accuracy is often NOT an appropriate evaluation metric for imbalanced data problems 8 6 4 Minimum Maximum Arithmetic Geometric Harmonic / capture different characteristics of our classifier 4 6 8 Precision (Recall fixed at 7%) Harmonic mean encourages / values that are similar! AUC and F can be used as a single metric to compare algorithm variations (and to tune hyperparameters) Phishing imbalanced data Black box approach Abstraction: we have a generic binary classifier, how can we use it to solve our new problem binary classifier + - optionally: also output a confidence/score Can we do some pre-processing/post-processing of our data to allow us to still use our binary classifiers?

9/3/3 Idea : subsampling Subsampling 99.997% labeled data 5% 5% Create a new training data set by: - including all k positive examples - randomly picking k negative examples Pros: Easy to implement Training becomes much more efficient (smaller training set) For some domains, can work very well Cons: Throwing away a lot of data/information.3% pros/cons? Idea : oversampling oversampling 99.997% labeled data Create a new training data set by: - including all m negative examples - include m positive examples: - repeat each example a fixed number of times, or - sample with replacement 5% 5% Pros: Easy to implement Utilizes all of the training data Tends to perform well in a broader set of circumstances than subsampling Cons: Computationally expensive to train classifier.3% pros/cons?

9/3/3 Idea b: weighted examples weighted examples 99.997% labeled data.3% cost/weights 99.997/.3 = 3333 Add costs/weights to the training set negative examples get weight positive examples get a much larger weight change learning algorithm to optimize weighted training error pros/cons? Pros: Achieves the effect of oversampling without the computational cost Utilizes all of the training data Tends to perform well in a broader set circumstances Cons: Requires a classifier that can deal with weights Of our three classifiers, can all be modified to handle weights? Building decision trees Idea 3: optimize a different error metric Otherwise: - calculate the score for each feature if we used it to split the data - pick the feature with the highest score, partition the data based on that data value and call recursively We used the training error to decide on which feature to choose: use the weighted training error Train classifiers that try and optimize F measure or AUC or or, come up with another learning algorithm designed specifically for imbalanced problems pros/cons? In general, any time we do a count, use the weighted count (e.g. in calculating the majority label at a leaf)

9/3/3 Idea 3: optimize a different error metric Train classifiers that try and optimize F measure or AUC or Challenge: not all classifiers are amenable to this or, come up with another learning algorithm designed specifically for imbalanced problems Don t want to reinvent the wheel! That said, there are a number of approaches that have been developed to specifically handle imbalanced problems 3