Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Similar documents
Applied Machine Learning Annalisa Marsico

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 3 Classification, Logistic Regression

Classification. Chapter Introduction. 6.2 The Bayes classifier

Building a Prognostic Biomarker

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

Linear regression methods

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Generalized Linear Models

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Is the test error unbiased for these programs? 2017 Kevin Jamieson

Introduction to Logistic Regression

Performance Evaluation

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

ECE521 Lecture7. Logistic Regression

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Feature Engineering, Model Evaluations

Lecture 5: LDA and Logistic Regression

The exam is closed book, closed notes except your one-page cheat sheet.

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Lecture 6: Methods for high-dimensional problems

Is the test error unbiased for these programs?

Proteomics and Variable Selection

Fast Regularization Paths via Coordinate Descent

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Introduction to Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Generative v. Discriminative classifiers Intuition

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Classification 1: Linear regression of indicators, linear discriminant analysis

CSC 411: Lecture 09: Naive Bayes

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

The lasso: some novel algorithms and applications

Statistical Data Mining and Machine Learning Hilary Term 2016

Machine Learning for OR & FE

An Overview of Outlier Detection Techniques and Applications

Midterm: CS 6375 Spring 2018

Real Estate Price Prediction with Regression and Classification CS 229 Autumn 2016 Project Final Report

Midterm exam CS 189/289, Fall 2015

MSA220/MVE440 Statistical Learning for Big Data

Distribution-free ROC Analysis Using Binary Regression Techniques

Nonparametric Bayesian Methods (Gaussian Processes)

CSCI-567: Machine Learning (Spring 2019)

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

A Blockwise Descent Algorithm for Group-penalized Multiresponse and Multinomial Regression

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Machine Learning (CSE 446): Neural Networks

Linear Models in Machine Learning

A simulation study of model fitting to high dimensional data using penalized logistic regression

A Bias Correction for the Minimum Error Rate in Cross-validation

Machine Learning (CS 567) Lecture 2

Supplementary Materials for

How to evaluate credit scorecards - and why using the Gini coefficient has cost you money

Directly and Efficiently Optimizing Prediction Error and AUC of Linear Classifiers

Bayesian Decision Theory

Logistic Regression. Seungjin Choi

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

MATH 567: Mathematical Techniques in Data Science Logistic regression and Discriminant Analysis

A Magiv CV Theory for Large-Margin Classifiers

Chap 1. Overview of Statistical Learning (HTF, , 2.9) Yongdai Kim Seoul National University

Introduction to Logistic Regression

Logistic Regression. COMP 527 Danushka Bollegala

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Statistical aspects of prediction models with high-dimensional data

Consistent high-dimensional Bayesian variable selection via penalized credible regions

Linear Methods for Prediction

Lecture #11: Classification & Logistic Regression

Performance evaluation of binary classifiers

Binary Classification / Perceptron

STAT5044: Regression and Anova

High-dimensional regression modeling

Warm up: risk prediction with logistic regression

PhD course: Statistical evaluation of diagnostic and predictive models

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Kernel Density Estimation

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Least Squares Classification

CS534: Machine Learning. Thomas G. Dietterich 221C Dearborn Hall

Performance Evaluation

Advanced statistical methods for data analysis Lecture 2

Click Prediction and Preference Ranking of RSS Feeds

10-701/ Machine Learning - Midterm Exam, Fall 2010

LASSO Review, Fused LASSO, Parallel LASSO Solvers

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

CMSC858P Supervised Learning Methods

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Convex Optimization Algorithms for Machine Learning in 10 Slides

Transcription:

Classification Classification is similar to regression in that the goal is to use covariates to predict on outcome. We still have a vector of covariates X. However, the response is binary (or a few classes), Y {0, 1} or Y { 1, 1} Examples: Y = 1 if the user clicks on an ad, X is the age, gender,... Y = 1 if the customer picks coke over pepsi, X is the age, gender,... Y = 1 if the email is spam, X is the email length, the number of commas, etc. Y = 1 if the patient responds to a treatment, X are baseline measures Y = 1 if the patient has breast cancer, X are 10,000 genetic markers We need different models and measures of success Outline: 1. Logistic regression 2. Nearest neighbors methods 3. Discriminant analysis 4. Classification trees 5. Support vector machines (5) Classification - Part 1 Page 1

Logistic regression Logistic regression is the most common method to analyze binary data among statisticians. It has nice interpretation and statistical properties, but often isn t the best for prediction. The logistic regression model is: Plot of the logistic and expit functions Other link functions are possible: Interpretation of the logistic regression coefficients: (5) Classification - Part 1 Page 2

Estimating the parameters The MLE of β is typically estimated using Newton-Raphson optimization The sampling distribution is Gaussian for large samples: The glm function in R fits logistic regression (5) Classification - Part 1 Page 3

Making classifications Logistic regression produces the estimated probability of an event, p i = expit(x T ˆβ). i In many cases the application requires a yes/no classification. One approach is to simply predict Y i = 1 if p i > 0.5 and vice versa. Depending on the nature of the problem other thresholds may be preferable. For example, if we assign different losses to different types of error we get: (5) Classification - Part 1 Page 4

Multi-class logistic regression In some cases there are more than two possible outcomes. Example, Y i {small,medium,large} or Y i {Democrat, Republican, Independent}. If the outcomes are ordered then the ordinal logistic regression model is: If the outcomes are unordered a common model is the discrete choice model: (5) Classification - Part 1 Page 5

Large p As with linear regression, the MLE is not unique if p > n. Also, the large sample approximation to the sampling distribution is invalid if p n. Many of the same variable selection methods apply with slight modification. Forward/backward/stepwise: Sure independence screening: PCA: (5) Classification - Part 1 Page 6

Penalized regression Penalized regression is also very popular. Conceptually it is the same, but computationally is a bit more challenging. The LASSO becomes: The R function glmnet uses coordinate descent to compute the solution Large sample approximation (LSA; Wang and Leng) can be fit in lars: (5) Classification - Part 1 Page 7

Non-linear logistic regression The expected value is non-linear in the covariates, but the model relies on a linear predictor. Many of the non-linear regression methods extend to logistic regression. For example, the mgcv package in R fits logistic regression with GAM The model is: Neural networks for logistic regression is: (5) Classification - Part 1 Page 8

Large n - Stochastic gradient descent Now say we have n = 10M observations. The optimization algorithm is: Both the gradient and Hessian are sums over observations. This sum can be approximated using SGD: (5) Classification - Part 1 Page 9

Large n - meta-analysis Another form of approximation relies on the asymptotic normality. Say we split the data into B subsets. From subset b the estimate and its approximate sampling distribution are: These first-stage estimates can be done in parallel. Now we treat these first-stage estimates as the data for our second stage analysis. The pooled estimate is: (5) Classification - Part 1 Page 10

glmnet with large n If n is large and p << n the lasso probably won t improve fit. However, if both n and p are massive: (5) Classification - Part 1 Page 11

Nearest neighbors Nearest neighbor methods can be applied to classification. Let d ij be the distance between X i and X j. For example d ij = X i X j. For a new observation Y 0 with covariates X 0, the predicted probability of Y 0 = 1 is The classifier is then: (5) Classification - Part 1 Page 12

Evaluating classification accuracy Cross-validation is a robust way to compare classifiers. A soft-classifier (SC) gives a probability (or some other weight) to each class, ˆp i [0, 1]. A hard-classifier (HC) definitively picks a class Ŷi {0, 1} SC can be converted to HC by thresholding, Ŷi = I(p i > c). The Brier score is a common way to compare SCs: HCs are compared by summaries of the contingency/confusion tables: Classification accuracy: False positive rate: True positive rate: False negative rate: True negative rate: (5) Classification - Part 1 Page 13

Evaluating classification accuracy The receiver-operating curve (ROC) evaluates HCs created by thresholding SCs. Let c be the threshold so that Ŷc = I(ˆp > c). For each c, we compute the FPR and TPR The ROC curve plots the TPR as a function of the TPR A common one-number summary of the ROC curve is the area under the ROC curve (AUC) ACU near one if perfect, AUC equal 0.5 is random guessing It can be shown that is AUC equals the probability of ranking a randomly-chosen positive observation higher than a randomly chosen negative observation The R function ROC compute the ROC curve and AUC (5) Classification - Part 1 Page 14

Discriminant analysis (DA) DA is an alternative method of classification with a tie to logistic regression. In classification we want f(y X). DA turns this around by estimating f(y) and f(x y) and then applying Bayes Theorem for classification: This is only advantageous when f(x y) is very simple. For example: (5) Classification - Part 1 Page 15

Discriminant analysis (DA) The naive Bayes classifier assumes the elements of X are independent given Y. For example, if X j Y = y Normal(µ jy, σjy), 2 then we set µ jy and σ jy using sample moments and the classifier is: Connection to logistic regression: (5) Classification - Part 1 Page 16