Machine Learning Linear Classification. Prof. Matteo Matteucci

Similar documents
Introduction to Data Science

Lecture 4 Discriminant Analysis, k-nearest Neighbors

ECE521 week 3: 23/26 January 2017

Lecture 9: Classification, LDA

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Introduction to Signal Detection and Classification. Phani Chavali

Machine Learning Linear Regression. Prof. Matteo Matteucci

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Applied Machine Learning Annalisa Marsico

Lecture 3 Classification, Logistic Regression

Classification. Chapter Introduction. 6.2 The Bayes classifier

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

BANA 7046 Data Mining I Lecture 4. Logistic Regression and Classications 1

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

SF2935: MODERN METHODS OF STATISTICAL LECTURE 3 SUPERVISED CLASSIFICATION, LINEAR DISCRIMINANT ANALYSIS LEARNING. Tatjana Pavlenko.

LDA, QDA, Naive Bayes

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

MS&E 226: Small Data

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

CMU-Q Lecture 24:

Bayesian Decision Theory

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

MS&E 226: Small Data

Machine Learning Support Vector Machines. Prof. Matteo Matteucci

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

ECE521 Lecture7. Logistic Regression

Applied Multivariate and Longitudinal Data Analysis

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

Sparse Linear Models (10/7/13)

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation requires to define performance measures to be optimized

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

Lecture 3. STAT161/261 Introduction to Pattern Recognition and Machine Learning Spring 2018 Prof. Allie Fletcher

Lecture 2 Machine Learning Review

Class 4: Classification. Quaid Morris February 11 th, 2011 ML4Bio

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Linear and Logistic Regression. Dr. Xiaowei Huang

The exam is closed book, closed notes except your one-page (two sides) or two-page (one side) crib sheet.

FINAL: CS 6375 (Machine Learning) Fall 2014

Machine Learning Practice Page 2 of 2 10/28/13

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

LECTURE NOTE #3 PROF. ALAN YUILLE

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction to Machine Learning Spring 2018 Note 18

Learning with multiple models. Boosting.

Machine Learning for OR & FE

Lecture 3. Linear Regression II Bastian Leibe RWTH Aachen

Performance Evaluation and Comparison

The Bayes classifier

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Least Squares Regression

A simulation study of model fitting to high dimensional data using penalized logistic regression

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

COMP 551 Applied Machine Learning Lecture 3: Linear regression (cont d)

Introduction to Bayesian Learning. Machine Learning Fall 2018

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

COMS 4771 Regression. Nakul Verma

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Assignment 3. Introduction to Machine Learning Prof. B. Ravindran

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

STA 4273H: Statistical Machine Learning

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Learning Classification with Auxiliary Probabilistic Information Quang Nguyen Hamed Valizadegan Milos Hauskrecht

Gaussian discriminant analysis Naive Bayes

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Least Squares Regression

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Logistic Regression. Machine Learning Fall 2018

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Machine Learning 2017

MODULE -4 BAYEIAN LEARNING

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

Notes on Discriminant Functions and Optimal Classification

Machine Learning (CS 567) Lecture 5

Feature Engineering, Model Evaluations

CMSC858P Supervised Learning Methods

CSC 411: Lecture 04: Logistic Regression

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Classification: Linear Discriminant Analysis

Biostatistics-Lecture 16 Model Selection. Ruibin Xi Peking University School of Mathematical Sciences

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Machine Learning and Data Mining. Linear classification. Kalev Kask

Bayesian Learning (II)

Lecture 5: LDA and Logistic Regression

Machine Learning for OR & FE

Classifier performance evaluation

Introduction to Logistic Regression

ESS2222. Lecture 4 Linear model

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

The Naïve Bayes Classifier. Machine Learning Fall 2017

Transcription:

Machine Learning Linear Classification Prof. Matteo Matteucci

Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X) Clustering Partitions

Example: Default dataset 3 Overall Default Rate 3%

Linear regression for classification? 4 o Suppose to predict the medical condition of a patient. How should this be encoded? We could use dummy variables in case of binary output but how to deal with multiple output? Different encodings could result in different models

Recall here the Bayesian Classifier 5 o For a classification problem we can use the error rate i.e. I Error y i yˆ ( i Rate ) n i 1 I( yi yˆ i ) / Where is an indicator function, which will give 1 if the condition y yˆ ) is correct, otherwise it gives a 0. ( i i The error rate represents the fraction of incorrect classifications, or misclassifications o The Bayes Classifier minimizes the Average Test Error Rate max j P( Y j X x0) o The Bayes error rate refers to the lowest possible Error Rate achievable knowing the true distribution of the data n The best classifier possible estimates the class posterior probability!!

Logistic Regression 6 o We want to model the probability of the class given the input but a linear model has some drawbacks

Example: Default data & Linear Regression 7 Overall Default Rate 3% Negative probability?

Logistic Regression 8 o We want to model the probability of the class given the input but a linear model has some drawbacks o Logistic regression solves the negative probability (ad other issues as well) by regressing the logistic function

Example: Default data & Logistic Regression 9

Logistic Regression 10 o We want to model the probability of the class given the input but a linear model has some drawbacks (see later slide) Linear Regression o Logistic regression solves the negative probability (ad other issues as well) by regressing the logistic function Logistic Regression from this we derive This is called odds and taking logarithms This is called log-odds or logit

Coefficient interpretation 11 o Interpreting what 1 means is not very easy with logistic regression, simply because we are predicting P(Y) and not Y. If 1 =0, this means that there is no relationship between Y and X If 1 >0, this means that when X gets larger so does the probability that Y = 1 If 1 <0, this means that when X gets larger, the probability that Y = 1 gets smaller. o But how much bigger or smaller depends on where we are on the slope, i.e., it is not linear

Training Logistic Regression (1/4) 12 o For the basic logistic regression wee need two parameters o In principle we could use (non linear) Least Squares fitting on the observed data the corresponding model o But a more principled approach for training in classification problems is based on Maximum Likelihood We want to find the parameters which maximize the likelihood function

Maximum Likelihood flash-back (1/6) 13 o Suppose we observe some i.i.d. samples coming from a Gaussian distribution with known variance: Which distribution do you prefer?

Maximum Likelihood flash-back (2/6) 14 o There is a simple recipe for Maximum Likelihood estimation o Let s try to apply it to our example

Maximum Likelihood flash-back (3/6) 15 o Let s try to apply it to our example 1. Write le likelihood for the data

Maximum Likelihood flash-back (4/6) 16 o Let s try to apply it to our example 2. (Take the logarithm of the likelihood -> log-likelihood)

Maximum Likelihood flash-back (5/6) 17 o Let s try to apply it to our example 3. Work out the derivatives using high-school calculus

Maximum Likelihood flash-back (6/6) 18 o Let s try to apply it to our example 4. Solve the unconstrained equations

Training Logistic Regression (1/4) 19 o For the basic logistic regression we need two parameters o In principle we could use (non linear) Least Squares fitting on the observed data the corresponding model o But a more principled approach for training in classification problems is based on Maximum Likelihood We want to find the parameters which maximize the likelihood function

Training Logistic Regression (2/4) 20 o Let s find the parameters which maximize the likelihood function o If we compute the log-likelihood for N observations Taken from ESL where o We obtain a log-likelihood in the form of Can you derive it?

Training Logistic Regression (3/4) 21 o Let s find the parameters which maximize the likelihood function Z-statistics has the same role of the regression t-statistics, a large value means the parameter is not null Intercept does not have a particular meaning is used to adjust the probability to class proportions

Example: Default data & Logistic Regression 22

Training Logistic Regression (4/4) 23 o Let s find the parameters which maximize the likelihood function We can train the model using qualitative variables through the use of binary (dummy) variables

Making predictions with Logistic Regression 24 o Once we have the model parameters we can predict the class o The Default probability having 1000$ balance is <1% while with a balance of 2000$ this becomes 58.6% o With qualitative variables, i.e., dummy variables, we get that being a students results in

Multiple Logistic Regression 25 o So far we have considered only one predictor, but we can extend the approach to multiple regressors o By maximum likelihood we learn the corresponding parameters What about this?

An apparent contradiction 26 Positive Negative!!!

Example: Confounding in Default data set 27

Example: South African Heart Disease 28 Taken from ESL

Logistic Regression for Feature Selection 29 o If we fit the complete model on these data we get Taken from ESL o While if we use stepwise Logistic Regression Taken from ESL

Logistic Regression parameters meaning 30 o Regression parameters represent the increment on the logit of probability given by a unitary increment of a variable o Let consider the increase of tobacco consumption in life od 1Kg, this count for an increase in log-odds of exp(0.081)=1.084 which means an overall increase of 8.4% o With a 95% confidence interval Taken from ESL

Regularized Logistic Regression 31 o As for Linear Regression we can compute a Lasso version

Regularized Logistic Regression 32 o As for Linear Regression we can compute a Lasso version Taken from ESL

Multiclass Logistic Regression 33 o Logistic Regression extends naturally to multiclass problems by computing the log-odds w.r.t. the K th class Comes from ESL, but it s worth knowing!!! Notation different because it comes from ESL o This is equivalent to o Can you prove it?!?!?

Wrap-up on Logistic Regression 34 o We model the log-odds as a linear regression model o This means the posterior probability becomes o Parameters represent log-odds increase per variable unit increment keeping fixed the others o We can use it to perform feature selection using z-scores and forward stepwise selection o The class decision boundary is linear, but points close to the boundary count more this will be discussed later

Beyond Logistic Regression 35 o Logistic Regression models directly class posterior probability o Linear Discriminant Analysis uses the Bayes Theorem o What improvements come with this model? Parameter learning unstable in Logistic Regression for well separated classes With little data and normal predictor distribution LDA is more stable A very popular algorithm with more than 2 response classes

Linear Discriminant Analysis (1/3) 36 o Suppose we want to discriminate among K>2 classes o Each class has a prior probability o Given the class we model the density function of predictors as o Using the Bayes Theorem we obtain Prior probability is relatively simple to learn Likelihood might be more tricky and we need some assumptions to simplify it o If we correctly estimate the likelihood we obtain the Bayes Classifier, i.e., the one with the smallest error rate!

Linear Discriminant Analysis (2/3) 37 o Let assume p=1 and use a Gaussian distribution o Let assume all classes have the same covariance o The posteriors probability as computed by LDA becomes o The selected class is the one with the highest posterior which is equivalent to the highest discriminating function Can you derive this? Linear discriminant function in x

Linear Discriminant Analysis (3/3) 38 o With 2 classes having the same prior probability we decide the class according to the inequality o The Bayes decision boundary corresponds to o Training is as simple as estimating the model parameters

LDA Simple Example (with p=1) 39

Linear Discriminant Analysis with p>1 (1/3) 40 o In case p>1 we assume comes from

Linear Discriminant Analysis with p>1 (2/3) 41 o In the case of p>1 the LDA classifier assumes Observations from the k-th class are drawn from The covariance structure is common to all classes o The Bayes discriminating function becomes Can you derive this? Still linear in x!!! o From this we can compute the boundary between each class (considering the two classes having the same prior probability) o Training formulas for the LDA parameters are similar to the case of p=1

Linear Discriminant Analysis Example 42

Example: LDA on the Default Dataset 43 o LDA on the Default dataset gets 2.75% training error rate Having 10000 records and p=3 we do not expect much overfitting by the way how many parameters we have? Being 3.33% the number of defaulters a dummy classifier would get a similar error rate

On the performance of LDA 44 o Errors in classification are often reported as a Confusion Matrix Sensitivity: percentage of true defaulters Specificity: percentage of non-defaulters correctly identified o The Bayes classifier optimize the overall error rate independently from the class they belong to an it does this by thresholding o Can we improve on this?

Example: Increasing LDA Sensitivity 45 o We might want to improve classifier sensitivity with respect to a given class because we consider it more critical Reduced Default error rate from 75.7% to 41.4% Increased overall error of 3.73% (but it is worth)

Tweaking LDA sensitivity 46 The right choice comes from domain knowledge

ROC Curve 47 o The ROC (Receiver Operating Characteristics) summarizes false positive and false negative errors o Obtained by testing all possible thresholds Overall performance given by Area Under the ROC Curve A classifier which randomly guesses (with two classes) has an AUC = 0.5 a perfect classifier has AUC = 1 o ROC curve considers true positive and false positive rates Sensitivity is equivalent to true positive rate Specificity is equivalent to 1 false positive rate

ROC Curve for LDA on Default data 48 All subjects are defaulters No subject is defaulter ROC Curve for Logistic Regression is ~ the same

Clearing out terminology 49 o When applying a classifier we can obtain o Out of this we can define the following

Quadratic Discriminant Analysis 50 o Linear Discriminant Analysis assumes all classes having a common covariance structure o Quadratic Discriminant Analysis assumes different covariances o Under this hypothesis the Bayes discriminant function becomes Can you derive this? Quadratic function o The decision LDA vs. QDA boils down to bias-variance trade-off QDA requires parameters while LDA only

QDA vs LDA Example 51

Which classifier is better? 52 o Let consider 2 classes and 1 predictor It can be seen that for LDA the log odds is given by While for Logistic Regression the log odds is Both linear functions but learning procedures are different o Linear Discriminant Analysis is the Optimal Bayes if its hypothesis holds otherwise Logistic Regression can outperforms it! o Quadratic Discriminant Analysis is to be preferred if the class covariances are different and we have a non linear boundary

Some scenarios tested 53 o Linear Boundary Scenarios 1. Samples from 2 uncorrelated normal distributions 2. Samples from 2 slightly correlated normal distributions 3. Samples from t-student distributed classes

Some other scenarios tested 54 o Non Linear Boundary Scenarios 4. Samples from 2 normal distribution with different correlation 5. Samples from 2 normals, predictors are quadratic functions 6. As previous but with a more complicated function

Overall conclusion on the comparison 55 o No method is better than all the others In the decision boundary is linear then LDA and Logistic Regression are those performing better When the decision boundary is moderately non linear QDA may give better results For much complex decision boundaries non parametric approaches such as KNN perform better, but the right level of smoothness has to be chosen