Least Squares Classification

Similar documents
Performance evaluation of binary classifiers

9. Least squares data fitting

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Anomaly Detection. Jing Gao. SUNY Buffalo

Least Squares Data Fitting

Stephen Scott.

Bayesian Decision Theory

Diagnostics. Gad Kimmel

Evaluation requires to define performance measures to be optimized

Performance Evaluation and Comparison

Lecture 3 Classification, Logistic Regression

Evaluation. Andrea Passerini Machine Learning. Evaluation

Classification: Analyzing Sentiment

CSC 411: Lecture 03: Linear Classification

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Entropy. Expected Surprise

Classification: Analyzing Sentiment

Evaluation & Credibility Issues

Classifier performance evaluation

Applied Machine Learning Annalisa Marsico

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

12. Cholesky factorization

Decision trees COMS 4771

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 4 Discriminant Analysis, k-nearest Neighbors

DECISION TREE LEARNING. [read Chapter 3] [recommended exercises 3.1, 3.4]

Performance Evaluation

Distribution-Free Distribution Regression

Performance Evaluation

Machine Learning (CS 567) Lecture 2

6.867 Machine learning: lecture 2. Tommi S. Jaakkola MIT CSAIL

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

CC283 Intelligent Problem Solving 28/10/2013

CS4445 Data Mining and Knowledge Discovery in Databases. B Term 2014 Solutions Exam 2 - December 15, 2014

Probabilistic Classification

Lecture 9: Classification, LDA

Nonlinear Least Squares

Lecture 9: Classification, LDA

Learning theory. Ensemble methods. Boosting. Boosting: history

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Online Advertising is Big Business

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Moving Average Rules to Find. Confusion Matrix. CC283 Intelligent Problem Solving 05/11/2010. Edward Tsang (all rights reserved) 1

Performance Metrics for Machine Learning. Sargur N. Srihari

Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn. Some overheads from Galit Shmueli and Peter Bruce 2010

15-388/688 - Practical Data Science: Nonlinear modeling, cross-validation, regularization, and evaluation

MIRA, SVM, k-nn. Lirong Xia

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Supervised Learning. George Konidaris

CptS 570 Machine Learning School of EECS Washington State University. CptS Machine Learning 1

STK Statistical Learning: Advanced Regression and Classification

Confusion matrix. a = true positives b = false negatives c = false positives d = true negatives 1. F-measure combines Recall and Precision:

Performance Evaluation

Introduction to Supervised Learning. Performance Evaluation

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Supplementary Information

Q1 (12 points): Chap 4 Exercise 3 (a) to (f) (2 points each)

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Lecture 9: Classification, LDA

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

12. Structural Risk Minimization. ECE 830 & CS 761, Spring 2016

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Generalized Linear Models

Hypothesis Evaluation

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

CMSC858P Supervised Learning Methods

Models, Data, Learning Problems

Performance Measures. Sören Sonnenburg. Fraunhofer FIRST.IDA, Kekuléstr. 7, Berlin, Germany

Machine Learning and Deep Learning! Vincent Lepetit!

Classification 1: Linear regression of indicators, linear discriminant analysis

Linear Classifiers: Expressiveness

Linear Classifiers. Michael Collins. January 18, 2012

Classification, Linear Models, Naïve Bayes

CS 188: Artificial Intelligence. Machine Learning

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Data Mining Prof. Pabitra Mitra Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur

Multiple regression: Categorical dependent variables

Performance Evaluation and Hypothesis Testing

Data Analytics for Social Science

Ensemble Methods. NLP ML Web! Fall 2013! Andrew Rosenberg! TA/Grader: David Guy Brizan

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

CSE446: Naïve Bayes Winter 2016

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

CS 188: Artificial Intelligence Spring Today

Jeffrey D. Ullman Stanford University

CS281B/Stat241B. Statistical Learning Theory. Lecture 1.

Time Series. Karanveer Mohan Keegan Go Stephen Boyd. EE103 Stanford University. November 15, 2016

CSC 411 Lecture 7: Linear Classification

Binary Classification / Perceptron

Variations of Logistic Regression with Stochastic Gradient Descent

Fundamentals to Biostatistics. Prof. Chandan Chakraborty Associate Professor School of Medical Science & Technology IIT Kharagpur

Statistical aspects of prediction models with high-dimensional data

Model Accuracy Measures

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Probability and Statistics. Terms and concepts

Support Vector Machines

MS&E 226: Small Data

Transcription:

Least Squares Classification Stephen Boyd EE103 Stanford University November 4, 2017

Outline Classification Least squares classification Multi-class classifiers Classification 2

Classification data fitting with outcome that takes on (non-numerical) values like true or false spam or not spam dog, horse, or mouse outcome values are called labels or categories data fitting is called classification we start with case when there are two possible outcomes called Boolean or 2-way classification we encode outcomes as +1 (true) and 1 (false) classifier has form ŷ = ˆf(x), f : R n { 1, +1} Classification 3

Applications email spam detection x contains features of an email message (word counts,... ) financial transaction fraud detection x contains features of proposed transaction, initiator document classification (say, politics or not) x is word count histogram of document disease detection x contains patient features, results of medical tests digital communications receiver y is transmitted bit; x contain n measurements of received signal Classification 4

Prediction errors data point (x, y), predicted outcome ŷ = ˆf(x) only four possibilities: True positive. y = +1 and ŷ = +1. True negative. y = 1 and ŷ = 1. (in these two cases, the prediction is correct) False positive. y = 1 and ŷ = +1. False negative. y = +1 and ŷ = 1. (in these two cases, the prediction is wrong) the errors have many other names, like Type I and Type II Classification 5

Confusion matrix given data set x (1),..., x (N), y (1),..., y (N) and classifier ˆf count each four outcomes ŷ = +1 ŷ = 1 total y = +1 N tp N fn N p y = 1 N fp N tn N n total N tp + N fp N fn + N tn N off-diagonal terms are prediction errors many error rates and accuracy measures are used error rate is (N fp + N fn )/N true positive (or recall) rate is N tp/n p false positive rate (or false alarm rate) is N fp /N n a proposed classifier is judged by its error rate(s) on a test set Classification 6

Example spam filter performance on a test set (say) ŷ = +1 ŷ = 1 total y = +1 95 32 127 y = 1 19 1120 1139 total 114 1152 1266 error rate is (19 + 32)/1266 = 4.03% false positive rate is 19/1139 = 1.67% Classification 7

Outline Classification Least squares classification Multi-class classifiers Least squares classification 8

Least squares classification fit model f to encoded (±1) y (i) values using standard least squares data fitting f(x) should be near +1 when y = +1, and near 1 when y = 1 f(x) is a number use model ˆf(x) = sign( f(x)) (size of f(x) is related to the confidence in the prediction) Least squares classification 9

Handwritten digits example MNIST data set of 70000 28 28 images of digits 0,..., 9 divided into training set (60000) and test set (10000) x is 494-vector, constant 1 plus the 493 pixel values with at least one nonzero value in data y = +1 if digit is 0; 1 otherwise Least squares classification 10

Least squares classifier results training set results (error rate 1.6%) test set results (error rate 1.6%) ŷ = +1 ŷ = 1 total y = +1 5165 758 5923 y = 1 179 53898 54077 total 5344 54656 60000 ŷ = +1 ŷ = 1 total y = +1 864 116 980 y = 1 42 8978 9020 total 906 9094 10000 we can likely achieve 1.6% error rate on unseen images Least squares classification 11

distribution of values of f(x (i) ) over training set 0.1 0.08 Positive Negative Fraction 0.06 0.04 0.02 0 2 1 0 1 2 f(x (i) ) Least squares classification 12

Coefficients in least squares classifier 0.1 0.05 0 0.05 0.1 Least squares classification 13

Skewed decision threshold use predictor ˆf(x) = sign( f(x) α), i.e., α is the decision threshold { +1 f(x) α ˆf(x) = 1 f(x) < α for positive α, false positive rate is lower but so is true positive rate for negative α, false positive rate is higher but so is true positive rate trade off curve of true positive versus false positive rates is called receiver operating characteristic (ROC) Least squares classification 14

Example 0.1 0.08 Positive Negative Fraction 0.06 0.04 0.02 0 2 1.5 1 0.5 0 0.5 1 1.5 2 f(xi) 1 0.8 True positive False positive Total error Rate 0.6 0.4 0.2 0 2 1.5 1 0.5 0 0.5 1 1.5 2 α Least squares classification 15

ROC curve True positive rate 1 0.9 0.8 0.7 α = 0.25 α = 0 α = 0.25 0 0.05 0.1 0.15 0.2 False positive rate Least squares classification 16

Outline Classification Least squares classification Multi-class classifiers Multi-class classifiers 17

Multi-class classifiers we have K > 2 possible labels, with label set {1,..., K} predictor is ˆf : R n {1,..., K} for given predictor and data set, confusion matrix is K K some off-diagonal entries may be much worse than others Multi-class classifiers 18

Examples handwritten digit classification guess the digit written, from the pixel values marketing demographic classification guess the demographic group, from purchase history disease diagnosis guess diagnosis from among a set of candidates, from test results, patient features translation word choice choose how to translate a word into several choices, given context features document topic prediction guess topic from word count histogram Multi-class classifiers 19

Least squares multi-class classifier create a least squares classifier for each label versus the others take as classifier ˆf(x) = argmax f l (x) l {1,...,K} (i.e., choose l with largest value of f l (x)) for example, with we choose ˆf(x) = 3 f 1 (x) = 0.7, f2 (x) = +0.2, f3 (x) = +0.8 Multi-class classifiers 20

Handwritten digit classification confusion matrix, test set Prediction Digit 0 1 2 3 4 5 6 7 8 9 Total 0 944 0 1 2 2 8 13 2 7 1 980 1 0 1107 2 2 3 1 5 1 14 0 1135 2 18 54 815 26 16 0 38 22 39 4 1032 3 4 18 22 884 5 16 10 22 20 9 1010 4 0 22 6 0 883 3 9 1 12 46 982 5 24 19 3 74 24 656 24 13 38 17 892 6 17 9 10 0 22 17 876 0 7 0 958 7 5 43 14 6 25 1 1 883 1 49 1028 8 14 48 11 31 26 40 17 13 756 18 974 9 16 10 3 17 80 0 1 75 4 803 1009 All 1042 1330 887 1042 1086 742 994 1032 898 947 10000 error rate is around 14% (same as for training set) Multi-class classifiers 21

Adding new features let s add 5000 random features (!), max{(rx) j, 0} R is 5000 494 matrix with entries ±1, chosen randomly now use least squares classification with 5494 feature vector results: training set error 1.5%, test set error 2.6% can do better with a little more thought in generating new features indeed, even better than humans can do (!!) Multi-class classifiers 22

Results with new features confusion matrix, test set Prediction Digit 0 1 2 3 4 5 6 7 8 9 Total 0 972 0 0 2 0 1 1 1 3 0 980 1 0 1126 3 1 1 0 3 0 1 0 1135 2 6 0 998 3 2 0 4 7 11 1 1032 3 0 0 3 977 0 13 0 5 8 4 1010 4 2 1 3 0 953 0 6 3 1 13 982 5 2 0 1 5 0 875 5 0 3 1 892 6 8 3 0 0 4 6 933 0 4 0 958 7 0 8 12 0 2 0 1 992 3 10 1028 8 3 1 3 6 4 3 2 2 946 4 974 9 4 3 1 12 11 7 1 3 3 964 1009 All 997 1142 1024 1006 977 905 956 1013 983 997 10000 Multi-class classifiers 23