BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

Similar documents
Machine Learning Linear Classification. Prof. Matteo Matteucci

Data Analytics for Social Science

Lecture 4 Discriminant Analysis, k-nearest Neighbors

Multiple regression: Categorical dependent variables

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Evaluation & Credibility Issues

Performance Evaluation

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Bayesian Decision Theory

Performance Measures. Sören Sonnenburg. Fraunhofer FIRST.IDA, Kekuléstr. 7, Berlin, Germany

Linear Regression Models P8111

Linear Discriminant Analysis Based in part on slides from textbook, slides of Susan Holmes. November 9, Statistics 202: Data Mining

Performance Evaluation

Lecture 9: Classification, LDA

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Lecture 9: Classification, LDA

MATH 567: Mathematical Techniques in Data Science Logistic regression and Discriminant Analysis

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Model Accuracy Measures

Applied Machine Learning Annalisa Marsico

Classification. Chapter Introduction. 6.2 The Bayes classifier

Part I. Linear Discriminant Analysis. Discriminant analysis. Discriminant analysis

Introduction to Data Science

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Applied Multivariate and Longitudinal Data Analysis

Lecture 9: Classification, LDA

Performance evaluation of binary classifiers

Big Data Analytics: Evaluating Classification Performance April, 2016 R. Bohn. Some overheads from Galit Shmueli and Peter Bruce 2010

BANA 7046 Data Mining I Lecture 2. Linear Regression, Model Assessment, and Cross-validation 1

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Linear Regression With Special Variables

Machine Learning Concepts in Chemoinformatics

Modern Methods of Statistical Learning sf2935 Lecture 5: Logistic Regression T.K

Linear Classifiers as Pattern Detectors

Multivariate statistical methods and data mining in particle physics

CLUe Training An Introduction to Machine Learning in R with an example from handwritten digit recognition

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Stephen Scott.

Anomaly Detection. Jing Gao. SUNY Buffalo

Lecture 3 Classification, Logistic Regression

Classification: Logistic Regression and Naive Bayes Book Chapter 4. Carlos M. Carvalho The University of Texas McCombs School of Business

Data Mining algorithms

How to evaluate credit scorecards - and why using the Gini coefficient has cost you money

Performance Evaluation

Data Mining 2018 Logistic Regression Text Classification

Linear Decision Boundaries

Machine Learning for OR & FE

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Generalized Linear Models and Exponential Families

MATH 829: Introduction to Data Mining and Analysis Logistic regression

Logistic Regression and Generalized Linear Models

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

Classifier Evaluation. Learning Curve cleval testc. The Apparent Classification Error. Error Estimation by Test Set. Classifier

Machine Learning (CS 567) Lecture 5

Logistic Regression. Seungjin Choi

Lecture 5: LDA and Logistic Regression

Generalized Linear Models

LDA, QDA, Naive Bayes

Week 5: Classification

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Pointwise Exact Bootstrap Distributions of Cost Curves

Chapter 10 Logistic Regression

Probabilistic Graphical Models

FINAL: CS 6375 (Machine Learning) Fall 2014

Classification objectives COMS 4771

1. Logistic Regression, One Predictor 2. Inference: Estimating the Parameters 3. Multiple Logistic Regression 4. AIC and BIC in Logistic Regression

Evaluation. Andrea Passerini Machine Learning. Evaluation

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear Classifiers as Pattern Detectors

Methods and Criteria for Model Selection. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Evaluation requires to define performance measures to be optimized

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Classification. Department Biosysteme Karsten Borgwardt Data Mining Course Basel Fall Semester / 162

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

CSC 411: Lecture 03: Linear Classification

Performance Metrics for Machine Learning. Sargur N. Srihari

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Article from. Predictive Analytics and Futurism. July 2016 Issue 13

Classifier performance evaluation

Logistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014

Pattern Recognition and Machine Learning. Learning and Evaluation for Pattern Recognition

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Meta-Analysis for Diagnostic Test Data: a Bayesian Approach

What does Bayes theorem give us? Lets revisit the ball in the box example.

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

The Poisson transform for unnormalised statistical models. Nicolas Chopin (ENSAE) joint work with Simon Barthelmé (CNRS, Gipsa-LAB)

Generalized Linear Models

Data Mining and Analysis: Fundamental Concepts and Algorithms

Probabilistic Graphical Models

Generative Models for Classification

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Transcription:

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1 Shaobo Li University of Cincinnati 1 Partially based on Hastie, et al. (2009) ESL, and James, et al. (2013) ISLR Data Mining I Lecture 4. Logistic Regression 1 / 23

From Continuous to Categorical Outcome The response variable, Y, is categorical Examples: - Banking: default vs. nondefault - Medical: disease positive vs. negative - Computer vision: sign recognition by self-driving cars - Many others... Data Mining I Lecture 4. Logistic Regression 2 / 23

Classication Denote C(X ) as a classier Most DM algorithms estimate the probability that X belong to each class Based on specic decision rules, classication can be produced Example: The model prediction tells that the probability of default is 0.2, then Threshold >0.2 <0.2 Class Nondefault Default Data Mining I Lecture 4. Logistic Regression 3 / 23

Classication Methods K-Nearest Neighbor Logistic regression Classication tree Discriminant analysis Support vector machine Neural networks Deep learning... Is clustering a classication model? Data Mining I Lecture 4. Logistic Regression 4 / 23

Why Not Linear Regression Example: default prediction - Default (Y = 1) vs. Nondefault (Y = 0) - X 1 : credit card balance level, X 2 : income level Suppose the estimated linear regression is Ŷ = 1.5 + 2X 1 X 2 What is the predicted value if a person's balance level is 1 and income level is 3? How to interpret this value? Data Mining I Lecture 4. Logistic Regression 5 / 23

An Illustration Data Mining I Lecture 4. Logistic Regression 6 / 23

An Illustration Data Mining I Lecture 4. Logistic Regression 6 / 23

Logistic Regression Generalized linear model Logistic model for binary response Function of X P(y i = 1 x i ) = exp(xt i β) 1 + exp(x T i β) Outcome is predicted probability of event More than two classes: multinomial logistic model Data Mining I Lecture 4. Logistic Regression 7 / 23

Generalized Linear Models Still linear models Three components: 1 Probability distribution of response variable - E.g. Binary, Poisson, Gamma... 2 Linear predictor η = β 0 + β 1 X 1 +... + β p X p 3 Link function g[e(y )] = η or E(Y ) = g 1 (η) Here are some notes for other link functions for binary response. Data Mining I Lecture 4. Logistic Regression 8 / 23

Odds and Interpretation of β Logistic model is also called log odds model Odds: ratio of probabilities: Odds(X ) = P(Y = 1 X )/P(Y = 0 X ) Logit link (logit transformation) ( ) P logit(p) = log = β 0 + β 1 X 1 +... + β p X p 1 P By simple algebra, given all X 's are xed except X j ( ) Odds(Xj + 1) β j = log Odds(X j ) which is log of odds ratio. Data Mining I Lecture 4. Logistic Regression 9 / 23

Multinomial Logit Model Response Y = 1, 2,..., K, K classes Given predictors x i ( ) P(yi = 2) log = β T 2 P(Y i = 1) x i ( ) P(yi = 3) log = β T 3 P(Y i = 1) x i. log The rst class 1 is the reference ( ) P(yi = K) = β T K P(Y i = 1) x i There are (K 1) (p + 1) coecients need to be estimated. Data Mining I Lecture 4. Logistic Regression 10 / 23

Estimation for Binary Logit Model Maximum likelihood estimation y i x i Ber(p i (x i )) Likelihood function of y i x i : L(y i ; x i, β) = p y i i (1 p i ) 1 y i ( exp(x T ) yi ( = i β) 1 1 + exp(x T i β) 1 + exp(x T i β) By simple algebra, total log-likelihood is (show in exercise) ) 1 yi l(β) = n i=1 { ( )} y i x T i β log 1 + exp(x T i β) Numerical optimization: Newton's method (a very good tutorial) Data Mining I Lecture 4. Logistic Regression 11 / 23

Prediction From Probability to Class Direct outcome of model: probability Next step: classication Need decision rule (cut-o probability p-cut) Not unique Data Mining I Lecture 4. Logistic Regression 12 / 23

Confusion Matrix Classication table based on a specic cut-o probability Used for model assessment Pred=1 Pred=0 True=1 True Positive (TP) False Negative (FN) True=0 False Positive (FP) True Negative (TN) FP: type I error; FN: type II error Dierent p-cut results in dierent confusion matrix Try to understand this table instead of memorizing! Data Mining I Lecture 4. Logistic Regression 13 / 23

Some Useful Measures Misclassication rate (MR) = FP+FN Total True positive rate (TPR) = TP : Sensitivity or Recall TP+FN True negative rate (TNR) = False positive rate (FPR) = True negative rate (FNR) = TN : Specicity FP+TN FP : 1 Specicity FP+TN FN : 1 Sensitivity TP+FN Positive predictive rate (PPR) = TP : Precision TP+FP False discovery rate (FDR) = 1 Precision Data Mining I Lecture 4. Logistic Regression 14 / 23

ROC Curve Receiver Operating Characteristic Plot of FPR (X) against TPR (Y) at various p-cut values Overall model assessment (not for a particular decision rule) Unique for a given model Area under the curve (AUC): a measure of goodness-of-t Data Mining I Lecture 4. Logistic Regression 15 / 23

ROC Curve Data Mining I Lecture 4. Logistic Regression 16 / 23

Precision and Recall More accurate measure for imbalanced data Widely used in document retrieval Precision = TP : TP+FP - fraction of retrieved instances that are relevant Recall = TP : TP+FN - fraction of relevant instances that are retrieved Neither incorporates TN (Y = 1 is of more interest) F -score: F = 2 precision recall precision+recall More details: see this highly cited paper Data Mining I Lecture 4. Logistic Regression 17 / 23

Precision-Recall Curve Data Mining I Lecture 4. Logistic Regression 18 / 23

Asymmetric Cost Example: compare following two confusion matrices based on two p-cut values Pred=1 Pred=0 True=1 10 40 True=0 10 440 Pred=1 Pred=0 True=1 40 10 True=0 130 320 Which one is better? In terms of what? What if this is about loan application - Y = 1: default customer - Default will cost much more than reject a loan application Data Mining I Lecture 4. Logistic Regression 19 / 23

Choice of Decision Threshold (p-cut) Do NOT simply use 0.5! In general, we use grid search method to optimize a measure of classication accuracy/loss - Cost function (symmetric or asymmetric) - F-score based on precision and recall Grid search with cross-validation Data Mining I Lecture 4. Logistic Regression 20 / 23

Discriminant Analysis Based on Bayes theorem: P(Y = k X = x) = P(X = x Y = k) P(Y = k) P(X = x) Discriminant analysis P(Y = k X = x) = f k(x) π k K l=1 f l(x) π l f k (x) is the assumed density function of X in class k π k can be simply calculated as the fraction of Y = k Data Mining I Lecture 4. Logistic Regression 21 / 23

Linear Discriminant Function Given x, nd the k such that f k (x) π k is largest Therefore, only f k (x) π k is of interest We assume f k (x) to be Gaussian density 1 f k (x) = exp ( (x µ ) k) 2 2πσk 2σ 2 By taking log and discard terms without k, we have δ k (x) = x µk σ 2 µ2 k 2σ 2 + log(π k) This is called linear discriminant score function Data Mining I Lecture 4. Logistic Regression 22 / 23

Comparison Between Logistic Model and LDA Logistic regression is a very popular classier especially for binary classication problem LDA is often used when n is small and classes are well separated, and Gaussian assumption is reasonable. Also when K > 2. Both are linear methods Data Mining I Lecture 4. Logistic Regression 23 / 23