Lecture 5: LDA and Logistic Regression

Similar documents
Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

Classification. Chapter Introduction. 6.2 The Bayes classifier

Lecture 5: Classification

Classification Methods II: Linear and Quadratic Discrimminant Analysis

Linear Methods for Prediction

STA 450/4000 S: January

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

ISyE 6416: Computational Statistics Spring Lecture 5: Discriminant analysis and classification

LDA, QDA, Naive Bayes

Linear Regression Models P8111

Generalized Linear Models. Kurt Hornik

Linear Methods for Prediction

Logistic Regression. Seungjin Choi

Statistical Data Mining and Machine Learning Hilary Term 2016

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Machine Learning Linear Classification. Prof. Matteo Matteucci

Logistic Regression and Generalized Linear Models

Outline of GLMs. Definitions

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Machine Learning 2017

Lecture 9: Classification, LDA

A Study of Relative Efficiency and Robustness of Classification Methods

Lecture 9: Classification, LDA

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Generalized Linear Models

Does Modeling Lead to More Accurate Classification?

Lecture 3: Statistical Decision Theory (Part II)

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

A Bahadur Representation of the Linear Support Vector Machine

Gaussian discriminant analysis Naive Bayes

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Generalized linear models

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Supervised Learning: Linear Methods (1/2) Applied Multivariate Statistics Spring 2012

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Linear Regression (9/11/13)

Logistic Regression 21/05

FSAN815/ELEG815: Foundations of Statistical Learning

POLI 8501 Introduction to Maximum Likelihood Estimation

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Statistical Machine Learning Hilary Term 2018

Lecture 9: Classification, LDA

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Classification 1: Linear regression of indicators, linear discriminant analysis

STAT5044: Regression and Anova

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Stat 5101 Lecture Notes

Classification 2: Linear discriminant analysis (continued); logistic regression

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Lecture 16: Modern Classification (I) - Separating Hyperplanes

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

MATH 567: Mathematical Techniques in Data Science Logistic regression and Discriminant Analysis

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Machine Learning (CS 567) Lecture 5

CMU-Q Lecture 24:

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Machine Learning Lecture 7

Association studies and regression

Linear Regression With Special Variables

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Introduction to Data Science

Naïve Bayes classification

CMSC858P Supervised Learning Methods

Maximum Likelihood, Logistic Regression, and Stochastic Gradient Training

COM336: Neural Computing

Notes on Discriminant Functions and Optimal Classification

Linear Models for Classification

LEC 4: Discriminant Analysis for Classification

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Linear Models for Classification

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Introduction to Machine Learning

Correlation and regression

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Classification. Classification is similar to regression in that the goal is to use covariates to predict on outcome.

Machine Learning 1. Linear Classifiers. Marius Kloft. Humboldt University of Berlin Summer Term Machine Learning 1 Linear Classifiers 1

Bayesian Classification Methods

Introduction to Machine Learning Spring 2018 Note 18

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Poisson regression 1/15

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Logistic Regression. Advanced Methods for Data Analysis (36-402/36-608) Spring 2014

Lecture 6: Discrete Choice: Qualitative Response

Midterm exam CS 189/289, Fall 2015

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Lecture 5. Gaussian Models - Part 1. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. November 29, 2016

Introduction to Machine Learning

Central Limit Theorem ( 5.3)

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

Generalized Linear Models

Ch 4. Linear Models for Classification

Transcription:

Lecture 5: and Logistic Regression Hao Helen Zhang Hao Helen Zhang Lecture 5: and Logistic Regression 1 / 39

Outline Linear Classification Methods Two Popular Linear Models for Classification Linear Discriminant Analysis () Logistic Regression Models Take-home message: Both and Logistic regression models rely on the linear-odd assumption, indirectly or directly. However, they estimate the coefficients in a different manner. Hao Helen Zhang Lecture 5: and Logistic Regression 2 / 39

Linear Classifier Linear Classification Methods Linear methods: The decision boundary is linear. Common linear classification methods: Linear regression methods (covered in Lecture 3) Linear log-odds (logit) models Linear logistic models Linear discriminant analysis () separating hyperplanes (introduced later) perceptron model (Rosenblatt 1958) Optimal separating hyperplane (Vapnik 1996) SVMs From now on, we assume equal costs (by default). Hao Helen Zhang Lecture 5: and Logistic Regression 3 / 39

Odds, Logit, and Linear Some terminologies Call the term Call log Pr(Y =1 X=x) Pr(Y =0 X=x) Pr(Y =1 X=x) Pr(Y =0 X=x) is called odds log of the odds, or logit function Linear odds models assume: the logit is linear in x, i.e., log Examples:, Logistic regression Pr(Y = 1 X = x) Pr(Y = 0 X = x) = β 0 + β T 1 x. Hao Helen Zhang Lecture 5: and Logistic Regression 4 / 39

Classifier Based on Linear Odd Models From the linear odds, we can obtain posterior class probabilities Pr(Y = 1 x) = Pr(Y = 0 x) = exp (β 0 + β1 T x) 1 + exp (β 0 + β1 T x) 1 1 + exp (β 0 + β1 T x) Assuming equal costs, the decision boundary is given by {x P(Y = 1 x) = 0.5} = {x β 0 + β T 1 x = 0}, which can be interpreted as zero log-odds Hao Helen Zhang Lecture 5: and Logistic Regression 5 / 39

logit function log[s/(1 s)] logistic curve exp(t)/(1+exp(t)) g(s) 4 2 0 2 4 p(t) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 s 4 2 0 2 4 t Hao Helen Zhang Lecture 5: and Logistic Regression 6 / 39

Linear Discriminant Analysis () assumes Assume each class density is multivariate Gaussian, i.e., X Y j N(µ j, Σ j ), j = 0, 1. Equal covariance assumption Σ j = Σ, j = 0, 1. In other words, both classes are from Gaussian and they have the same covariance matrix. Hao Helen Zhang Lecture 5: and Logistic Regression 7 / 39

Linear Discriminant Function Under the mixture Gaussian assumption, the log-odd is log Pr(Y = 1 X = x) Pr(Y = 0 X = x) = log π 1 π 0 1 2 (µ 1 + µ 0 ) T Σ 1 (µ 1 µ 0 ) + x T Σ 1 (µ 1 µ 0 ) Under equal costs, the classifies to 1 if and only if [ log π 1 1 ] π 0 2 (µ 1 + µ 0 ) T Σ 1 (µ 1 µ 0 ) + x T Σ 1 (µ 1 µ 0 ) > 0. It has a linear boundary {x : β 0 + x T β 1 = 0}, with β 0 = log π 1 1 π 0 2 (µ 1 + µ 0 ) T Σ 1 (µ 1 µ 0 ), β 1 = Σ 1 (µ 1 µ 0 ). Hao Helen Zhang Lecture 5: and Logistic Regression 8 / 39

Parameter Estimation in In practice, π 1, π 0, µ 0, µ 1, Σ are unknown We estimate the parameters from the training data, using MLE or the moment estimator ˆπ j = n j /n, where n k is the size size of class j. ˆµ j = Y i =j x i/n j for j = 0, 1. The sample covariance matrix is S j = 1 n j 1 (x i ˆµ j )(x i ˆµ j ) T Y i =j (Unbiased) pooled sample covariance is a weighted average ˆΣ = = n 0 1 (n 0 1) + (n 1 1) S n 1 1 0 + (n 0 1) + (n 1 1) S 1 1 (x i ˆµ j )(x i ˆµ j ) T /(n 2) j=0 Y i =j Hao Helen Zhang Lecture 5: and Logistic Regression 9 / 39

R code for Fitting (I) There are two ways to call the function lda. The first way is to use a formula and an optional data frame. library(mass) lda(formula, data,subset) Arguments: Output: formula: the form groups x 1 + x 2 +..., where the response is the grouping factor and the right hand side specifies the (non-factor) discriminators. data: data frame from which variables specified subset: An index vector specifying the cases to be used in the training sample. an object of class lda with multiple components Hao Helen Zhang Lecture 5: and Logistic Regression 10 / 39

R code for Fitting (II) The second way is to use a matrix and group factor as the first two arguments. library(mass) lda(x, grouping, prior = proportions, CV = FALSE) Arguments: Output: x: a matrix or data frame or Matrix containing predictors. grouping: a factor specifying the class for each observation. prior: the prior probabilities of class membership. If unspecified, the class proportions for the training set are used. If CV = TRUE, the return value is a list with components class (the MAP classification, a factor) and posterior (posterior probabilities for the classes). Hao Helen Zhang Lecture 5: and Logistic Regression 11 / 39

R code for Prediction We use the predict or predict.lda function to classify multivariate observations with lda predict(object, newdata,...) Arguments: Output: object: object of class lda newdata: data frame of cases to be classified or, if object has a formula, a data frame with columns of the same names as the variables used. a list with the components class (the MAP classification, a factor) and posterior (posterior probabilities for the classes) Hao Helen Zhang Lecture 5: and Logistic Regression 12 / 39

Fisher s Iris Data (Three-Classification Problems) Fisher (1936) The use of multiple measurements in taxonomic problems. Three species: Iris setosa, Iris versicolor, Iris verginica Four features: the length and the width of the sepals and petals, in centimeters. 50 samples from each species In the following analysis, we randomly select 50% of the data points as the training set, and the rest as the test set. Hao Helen Zhang Lecture 5: and Logistic Regression 13 / 39

Hao Helen Zhang Lecture 5: and Logistic Regression 14 / 39

Illustration 1 Linear Classification Methods Iris <- data.frame(rbind(iris3[,,1], iris3[,,2], iris3[,,3]), Sp = rep(c("s","c","v"), rep(50,3))) train <- sample(1:150, 75) table(iris$sp[train]) z <- lda(sp ~., Iris, prior = c(1,1,1)/3, subset = train) Hao Helen Zhang Lecture 5: and Logistic Regression 15 / 39

Training Error and Test Error #training error rate ytrain <- predict(z, Iris[train, ])$class table(ytrain, Iris$Sp[train]) train_err <- mean(ytrain!=iris$sp[train]) #test error rate ytest <- predict(z, Iris[-train, ])$class table(ytest, Iris$Sp[-train]) test_err <- mean(ytest!=iris$sp[-train]) Hao Helen Zhang Lecture 5: and Logistic Regression 16 / 39

Illustration 2 Linear Classification Methods tr <- sample(1:50, 25) train <- rbind(iris3[tr,,1], iris3[tr,,2], iris3[tr,,3]) test <- rbind(iris3[-tr,,1], iris3[-tr,,2], iris3[-tr,,3]) cl <- factor(c(rep("s",25), rep("c",25), rep("v",25))) z <- lda(train, cl) ytest <- predict(z, test)$class Hao Helen Zhang Lecture 5: and Logistic Regression 17 / 39

Logistic Regression Linear Classification Methods Model assumption: the log-odd is linear in x. Define log Pr(Y = 1 X = x) Pr(Y = 0 X = x) = β 0 + β T 1 x. p(x; β) = exp (β 0 + β T 1 x) 1 + exp (β 0 + β T l x) Write β = (β 0, β 1 ). The posterior class probabilities can be calculated as p 1 (x) = Pr(Y = 1 x) = p(x; β), p 0 (x) = Pr(Y = 0 x) = 1 p(x; β). The classification boundary is given by: {x : β 0 + β T 1 x = 0}. Hao Helen Zhang Lecture 5: and Logistic Regression 18 / 39

Model Interpretation Denote µ = E(Y X) = P(Y = 1 X) By assuming g(µ) = log[µ/(1 µ)] = β 0 + β1 T X, the logit g connects µ with the linear predictor β 0 + β1 T X. We call g the link function Var(Y X) = µ(1 µ). Hao Helen Zhang Lecture 5: and Logistic Regression 19 / 39

Interpretation of β j In logistic regression, odds Xj = odds(..., X j = x + 1,...) odds(..., X j = x,...) = e β j. If X j = 0 or 1, then odds for group with X j = 1 are e β j higher than for group with X j = 0, with other parameters fixed. For rare diseases, when incidence is rare, Pr(Y = 0) 1, then odds Pr(Y = 1) e β j = odds Xj Pr(..., X j = x + 1,...). Pr(..., X j = x,...) in a cancer study, an log-or of 5 means that smokers are e 5 150 times more likely to develop the cancer Hao Helen Zhang Lecture 5: and Logistic Regression 20 / 39

Maximum Likelihood Estimate (MLE) for Logistic Models The joint conditional likelihood of y i given x i is l(β) = n log p yi (x; β), i=1 where In details, l(β) = =. p y (β; x) = p(x; β) y [1 p(x; β)] 1 y. n {y i log p(x i ; β) + (1 y i ) log[1 p(x i, β)]} i=1 n {y i (β 10 + β T 1 x i ) log[1 + exp (β 10 + β T 1 x i )]. i=1 Hao Helen Zhang Lecture 5: and Logistic Regression 21 / 39

Score Equations Linear Classification Methods For simplicity, now assume x i has 1 in its first component. l(β) n β = x i [y i p(x i ; β)] = 0, i=1 Totally (d+1) nonlinear equations. The first equation n y i = i=1 n p(x i ; β), i=1 expected number of 1 s = observed number in sample. y = [y 1,, y n ] T p = [p(x 1 ; β old ),, p(x n ; β old )] T. { W = diag p(x i ; β old )(1 p(x i ; β old } )). Hao Helen Zhang Lecture 5: and Logistic Regression 22 / 39

Newton-Raphson Algorithm The second-derivative (Hessian) matrix 2 l(β) β β T = n x i x T i p(x i ; β)[1 p(x i ; β)]. i=1 1 Choose an initial value β 0 2 Update β by β new = β old Using matrix notations, we have [ 2 ] 1 l(β) l(β) β β T β old β β old l(β) β = X T (y p), 2 l(β) β β T = X T WX. Hao Helen Zhang Lecture 5: and Logistic Regression 23 / 39

Iteratively Re-weighted Least Squares (IRLS) Newton-Raphson step β new = β old + (X T WX ) 1 X T (y p) ( = (X T WX ) 1 X T W X β old ) + W 1 (y p) = (X T WX ) 1 X T W z, where we defined the adjusted response z = X β old + W 1 (y p) Repeatedly solve weighted least squares till convergence. β new = arg min β (z X β)t W (z X β), Weight W, response z, and p change in each iteration. The algorithm can be generalized to K 3 case. Hao Helen Zhang Lecture 5: and Logistic Regression 24 / 39

Quadratic Approximations β satisfies a self-consistency relationship: it solves a weighted least square fit with response z i = x T i β + (y i ˆp i ) ˆp i (1 ˆp i ) and the weight w i = ˆp i (1 ˆp i ). β = 0 seems to be a good starting value Typically the algorithm converges, but it is never guaranteed. The weighted residual sum-of-squared is Pearson chi-square statistic n (y i ˆp i ) 2 ˆp i (1 ˆp i ), i=1 a quadratic approximation to the deviance. Hao Helen Zhang Lecture 5: and Logistic Regression 25 / 39

Statistical Inferences Using the weighted least squares formulation, we have Asymptotic likelihood theory says: if the model is correct, β is consistent Using central limit theorem, the distribution of β converges to N(β, (X T WX ) 1 ) Model building is costly due to iterations, popular shortcuts: For inclusion of a term, use Rao score test. For exclusion of a term, use Wald test. Neither of these two algorithms require iterative fitting, and are based on the maximum likelihood fit of the current model. Hao Helen Zhang Lecture 5: and Logistic Regression 26 / 39

R Code for Logistic Regression logist <- glm(formula, family, data, subset,...) predict(logist) Arguments: Output: formula: an object of class formula family: a description of the error distribution and link function to be used in the model. data: an optional data frame, list or environment containing the variables in the model. subset: an optional vector specifying a subset of observations to be used in the fitting process. returns an object of class inheriting from glm and lm Hao Helen Zhang Lecture 5: and Logistic Regression 27 / 39

R Code for Logistic Prediction logist <- glm(y~x, data, family=binomial(link="logit")) predict(logist, newdata) summary(logist) anova(logist) The function predict gives the predicted values, newdata is a data frame The function summary is used to obtain or print a summary of the results The function anova produces an analysis of variance table. Hao Helen Zhang Lecture 5: and Logistic Regression 28 / 39

Relationship between and Least Squares (LS) For two-class problems, both and least squares fit a linear boundary β 0 + β T x. Their solutions have the following relationship: The least square regression coefficient ˆβ is proportional to the direction, i.e., ˆβ ˆΣ 1 (ˆµ 1 ˆµ 0 ), where ˆΣ is the pooled sample covariance matrix, and ˆµ k is the sample mean of points from class k, k = 0, 1. In other words, their slope coefficients are identical, up to a scalar multiple. (Exercise 4.2 in textbook) The LS intercept ˆβ 0 is generally different from that of, unless n 1 = n 0. In general, they have different decision rules (unless n 1 = n 2 ). Hao Helen Zhang Lecture 5: and Logistic Regression 29 / 39

Common Feature of and Logistic Regression Models For both and logistic regression, the logit has a linear form log Pr(Y = 1 x) Pr(Y = 0 x) = β 0 + β T 1 x. Or equivalently, for both estimators, their posterior class probability can be expressed in the form of Pr(Y = 1 x) = exp (β 0 + β T 1 x) 1 + exp (β 0 + β T 1 x). They have exactly same forms. Are they same estimators? Hao Helen Zhang Lecture 5: and Logistic Regression 30 / 39

Major Differences of and Logistsics Regresion Main differences of two estimators include Where is the linear-logit from? What assumptions are made on the data distribution? (Difference on data distribution) How to estimate the linear coefficients? (Difference in parameter estimation) Any assumption on the marginal density of X? (flexibility of model) Hao Helen Zhang Lecture 5: and Logistic Regression 31 / 39

Where is Linear-logit from? For, the linear logit is due to the equal-covariance Gaussian assumption on data log Pr(Y = 1 x) Pr(Y = 0 x) = log π 1 π 0 1 2 (µ 1 + µ 0 ) T Σ 1 (µ 1 + µ 0 ) +x T Σ 1 (µ 1 µ 0 ) = β 0 + β T 1 x For Logistic model, the linear logit is due to construction log Pr(Y = 1 x) Pr(Y = 0 x) = β 0 + β T 1 x Hao Helen Zhang Lecture 5: and Logistic Regression 32 / 39

Difference in Marginal Density Assumption The assumptions on Pr(X): The logistic model leaves the marginal density of X arbitrary and unspecified. The model assumes a Gaussian density Pr(X) = 1 π j φ(x; µ j, Σ) j=0 Conclusion: The logistic model makes less assumptions about the data, and hence is more general. Hao Helen Zhang Lecture 5: and Logistic Regression 33 / 39

Difference in Parameter Estimation Logistic regression Maximizing the conditional likelihood, the multinomial likelihood with probabilities Pr(Y = k X) The marginal density Pr(X) is totally ignored (fully nonparametric using the empirical distribution function which places 1/n at each observation) Maximizing the full log-likelihood based on the joint density Pr(X, Y = j) = φ(x; µ j, Σ)π j, Standard MLE theory leads to estimators ˆµ j, ˆΣ, ˆπ j Marginal density does play a role Hao Helen Zhang Lecture 5: and Logistic Regression 34 / 39

More Comments Linear Classification Methods is easier to compute than logistic regression. If the true f k (x) s are Gaussian, is better. Logistic regression may lose efficiency around 30% asymptotically in error rate (by Efron 1975) Robustness? uses all the points to estimate the covariance matrix; more information but not robust against outliers Logistic regression down-weights points far from decision boundary (recall the weight is p i (1 p i ); more robust and safer In practice, these two methods often give similar results (for approximately normal distributed data) Hao Helen Zhang Lecture 5: and Logistic Regression 35 / 39

Two-dimensional Linear Example Consider the following scenarios for a two-class problem: π 1 = π 0 = 0.5. Class 1: X N 2 ((1, 1) T, 4I) Class 2: X N 2 (( 1, 1) T, 4I) The Bayes boundary is We generate {x : x 1 x 2 = 0}. the training set of n = 200, 400 to fit the classifier the testing set of n = 2000 to evaluate its prediction performance. Hao Helen Zhang Lecture 5: and Logistic Regression 36 / 39

4 2 0 2 4 4 2 0 2 4 Training Scatter Plot x1 x2 4 2 0 2 4 4 2 0 2 4 Bayes and Linear Classifiers x1 x2 bay lin 4 2 0 2 4 4 2 0 2 4 Logistic and Classifiers x1 x2 log lda 4 2 0 2 4 4 2 0 2 4 All Linear Classifiers x1 x2 bay lin log lda Hao Helen Zhang Lecture 5: and Logistic Regression 37 / 39

Performance of Various Classifiers (n=200) Bayes Linear Logistic Train Error 0.250 0.255 0.255 0.255 Test Error 0.240 0.247 0.242 0.247 Fitted classification boundaries: Linear & : x 2 = 0.48 1.73x 1 Logistic: x 2 = 0.32 1.80x 1. Hao Helen Zhang Lecture 5: and Logistic Regression 38 / 39

4 2 0 2 4 4 2 0 2 4 Training Scatter Plot x1 x2 4 2 0 2 4 4 2 0 2 4 Bayes and Linear Classifiers x1 x2 bay lin 4 2 0 2 4 4 2 0 2 4 Logistic and Classifiers x1 x2 log lda 4 2 0 2 4 4 2 0 2 4 All Linear Classifiers x1 x2 bay lin log lda Hao Helen Zhang Lecture 5: and Logistic Regression 39 / 39