Chap 2. Linear Classifiers (FTH, 4.1-4.4) Yongdai Kim Seoul National University
Linear methods for classification 1. Linear classifiers For simplicity, we only consider two-class classification problems (i.e Y = {0, 1} or Y = { 1, 1}). The loss function for classification is the 0-1 loss given as l(y, a) = I(y a). Linear methods for classification assume that the decision boundary is given as {x : β 0 + x β = 0}. That is, for a given linear function f(x) = β 0 + x β and Y = { 1, 1}, we construct the corresponding classifier G : R p Y by G(x) = signf(x). Seoul National University. 1
Three popular linear classifiers Linear Discriminant Analysis (LDA): Mixture of Gaussian models Logistic regression: Regression approach Optimal separating hyperplane: Machine learning approach, SVM * Among these, in this section, we only consider the LDA and logistic regression. We will study the optimal separating hyperplane when we study SVM. Seoul National University. 2
2. LDA Model Let f j (x) is the class conditional density of x in class y = j where y { 1, 1}. Let π j, j = 1, 1 be the prior probabilities (i.e. π j = Pr(y = j).) Suppose that we model each class density as multivariate Gaussian 1 ( f j (x) = (2π) p/2 Σ j exp 1 ) 1/2 2 (x µ j) Σ 1 j (x µ j ) where µ j is the mean vector and Σ j is the covariance matrix. Seoul National University. 3
Bayes classifier As we have seen in Chap 1, the Bayes classifier is given as ( ) Pr(y = 1 x) G(x) = sign log. Pr(y = 0 x) Since Pr(y = j x) f j (x)π j, we have log Pr(y = j x)(= δ j (x)) = 1 2 Σ j 1 2 (x µ j) Σ 1 j (x µ j )+log π j +C. Hence, the Bayes classifier is given as G(x) = sign(δ 1 (x) δ 1 (x)). We call the functions δ j (x) the discriminant functions. Seoul National University. 4
LDA We assume that all Σ 1 = Σ 1 (= Σ). In this case,we can easily see that the Bayes classifier is given as G(x) = sign(δ 1 (x) δ 1 (x)) where δ j (x) = x Σ 1 µ j 1 2 µ jσ 1 µ j + log π j. That is, the Bayes classifier is a linear classifier. The functions δ j are called the linear discriminant functions. Seoul National University. 5
QDA QDA is an abbreviation of the Quadratic Discriminant Analysis. When Σ 1 Σ 1, the decision boundary of the Bayes classifier is a quadratic function. Seoul National University. 6
Estimation We can easily estimate µ j and Σ j by ˆµ j = n i=1 x ii(y i = j)/n j ˆΣ j = n i=1 (x i µ j )(x i µ j ) I(y i = j)/(n j 1) n j = n i=1 I(y i = j). Also, unless specified, we estimate π j by n j /n. For LDA, we estimate Σ by the pooled variance-covariance matrix ˆΣ = (n 1 1)ˆΣ 1 + (n 1 1)ˆΣ 1 n 2. Seoul National University. 7
3. Logistic Regression Model Y = {0, 1}. The logistic model assumes Pr(y = 1 x) = exp(β 0 + x β) (= ϕ(x, β)). 1 + exp(β 0 + x β) * We abuse the notation slightly to let β = (β 0, β). Seoul National University. 8
Motivation 1 Consider a linear regression Pr(y = 1 x) = β 0 + x β. It violates that the constraint Pr(y = 1 x) [0, 1]. A simple remedy for this problem is to set Pr(y = 1 x) = F (β 0 + x β) where F is a distribution function. Examples for F Gaussian: Probit model Gompertz: F (x) = exp( exp(x)), popularly used in Insurance Logistic: F (x) = exp(x)/(1 + exp(x)). Seoul National University. 9
Motivation 2 Consider the decision boundary {x : Pr(Y = 1 X = x) = 0.5}. This is equivalent to {x : log(pr(y = 1 X = x)/pr(y = 0 X = x)) = 0}. Suppose that the log-odds is linear. That is log(pr(y = 1 X = x)/pr(y = 0 X = x)) = β 0 + x β, This implies that Pr(Y = 1 X = x) = exp(β 0 + x β) 1 + exp(β 0 + x β). Seoul National University. 10
Estimation Use the maximum likelihood approach The likelihood is simply the probability of the observations given as n L(β) = Pr(y = y i x = x i ). i=1 Estimate β by maximizing the log-likelihood l(β) = n ( ) y i (β 0 + x iβ) log(1 + exp (β 0 + x iβ)). i=1 Seoul National University. 11
Computation One obstacle of using the logistic regression would be computation since maximizing the log-likelihood is not easy. However, we can do it efficiently using the Iteratively Reweighted Least Squares (IRLS) algorithm explained as follows. Find the MLE of β via (Newton-Raphson algorithm) β new = β old ( 2 l(β) β β ) 1 l(β) β, Seoul National University. 12
This is equivalent to β new = (X WX) 1 X Wz where W is a n n diagonal matrix of its (i, i)th element being ϕ(x i : β old )(1 ϕ(x i : β old )) and z = Xβ old + W 1 (y p) where X is the design matrix, y = (y 1,..., y n ) p = (ϕ(x i : β old ), i = 1,..., n). and Hence, β new is the weighted square estimator with the adjusted response z: β new = argmax β (z Xβ) W(z Xβ). To sum up, the MLE of the logistic regression coefficient can be obtained by applying the weighted least square iteratively.sion Seoul National University. 13
4. LDA or Logistic regression Note that logistic regression and LDA have linear decision boundaries. Logistic regression only needs the specification of Pr(Y = 1 X = x) (that is, Pr(X = x) is completely undetermined). On the other hand, the LDA needs the specification of the joint distribution Pr(Y, X). In fact, in LDA, the marginal distribution of x is a mixture of Gaussians Pr(x) = π 1 N(µ 1, Σ) + π 1 N(µ 1, Σ). Hence, LDA needs more assumptions and hence less applicability than the logistic regression. Seoul National University. 14
Also, categorical input variables are allowable for the logistic regression (using dummy variables) while LDA has troubles with such inputs. However, LDA is a useful tool when some of the output are missing (semi-supervised learning). Seoul National University. 15
5. Extension to Multi-class problems Let Y = {1,..., K}. In this case, we construct K many linear functions f k (x) = β 0k + x β k for k = 1,..., K. Then, construct a classifier by G(x) = argmax k f k (x). Seoul National University. 16
Linear regression For k = 1,..., K Let y (k) i = I(y i = k). Construct f k (x) by regressing x i s on y (k) i. Note that E(y (k) i x i ) = Pr(y i = k x i ), and hence we would expect that it works reasonably well. Seoul National University. 17
LDA and QDA LDA Simply, assume that x i y i = k N p (µ k, Σ). Estimate µ k s and Σ from the data and construct a Bayes classifier. QDA Assume x i y i = k N p (µ k, Σ k ). Estimate µ k s and Σ k from the data and construct a Bayes classifier. Seoul National University. 18
Logistic regression Assume That is Pr(y = k x) exp(β 0k + x β k ). Pr(y = k x) = exp(β 0k + x β k ) K l=1 exp(β 0l + x β l ) For identifiability, we let β 01 = 0 and β 1 = 0. The parameters are estimated by the maximum likelihood estimator. Seoul National University. 19
Masking effect of the linear regression Seoul National University. 20
Masking effect of the linear regression (continued) Seoul National University. 21
Empirical comparison Go to http://www-stat.stanford.edu/~tibs/elemstatlearn/ for the data set vowel. Seoul National University. 22
6. HW Reconstruct Table 4.1. Seoul National University. 23