ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

ECLT 5810 Linear Regression and Logistic Regression for Classification Prof. Wai Lam

Linear Regression Models Least Squares Input vectors is an attribute / feature / predictor (independent variable) The linear regression model: The output is called response (dependent variable) s are unknown parameters (coefficients) 2

Linear Regression Models Least Squares A set of training data Each corresponding to attributes Each is a class attribute value / a label Wish to estimate the parameters 3

Linear Regression Models Least Squares One common approach the method of least squares: Pick the coefficients minimize the residual sum of squares: to 4

Linear Regression Models Least Squares This criterion is reasonable if the training observations represent independent draws. Even if the s were not drawn randomly, the criterion is still valid if the s are conditionally independent given the inputs. 5

Linear Regression Models Least Squares Make no assumption about the validity of the model Simply finds the best linear fit to the data 6

Linear Regression Models Finding Residual Sum of Squares Denote by the matrix with each row an input vector (with a 1 in the first position) Let be the N vector of outputs in the training set Quadratic function in the parameters: 7

Linear Regression Models Finding Residual Sum of Squares Set the first derivation to zero: Obtain the unique solution: 8

Linear Regression Models Orthogonal Projection The fitted values at the training inputs are: The matrix appearing in the above equation, called hat matrix because it puts the hat on 9

Linear Regression Models Example Training Data: x y (1, 2, 1) 22 (2, 0, 4) 49 (3, 4, 2) 39 (4, 2, 3) 52 (5, 4, 1) 38 10

Linear Regression Models Example 1 1 2 1 1 2 0 4 1 3 4 2 1 1 4 5 2 4 3 1 22 49 39 52 38 4.04 0.51 8.43 8.13 11

Linear Regression Models Example 4.04 0.51 8.43 8.13 21.61 0.39 residual vector 49.91 0.91 39.13 0.13 50.57 1.4 38.78 0.78 12

Discriminant Functions Suppose there are classes, labeled 1,2,, A class of methods that model for each class. Then, classify to the class with the largest value for its discriminant function Decision boundary between class and is that set of points for which l 13

Suppose is the class conditional density of in class, i.e., Let be the prior probability of class, with A simple application of Bayes theorem: l l l l 14

Desire to model the posterior probabilities of the classes via linear functions in (p dimensional vector) Ensuring they sum to one and remain in Model: 15

Specified in terms of log odds or logit transformations Choice of denominator is arbitrary estimates are equivariant under this choice l l l Sum to 1 l l l 16

Two class Classification For two class classification, we can model two classes as 0 and 1. Treating the class 1 as the concept of interest, the posterior probability can be regarded as the class membership probability: Pr 1 exp 1exp logistic function As a result, it maps in p dimensional space to a value in [0,1] 17

Shape of sigmoid curve Consider 1 dimensional Pr 18

An Example of One dimension We wish to predict death from baseline APACHE II score of patients. Let Pr be the probability that a patient with score will die. Note that linear regression would not work well since it could produce probabilities less than 0 or greater than 1 19

An Example of One dimension Data that has a sharp survival cut off point between patients who live or die will lead to a large value of 20

An Example of One dimension One the other hand, if the data has a lengthy transition from survival to death, it will lead to a low value of 21

Model Fitting for General Cases (K classes, p Dimension) Logistic regression models fit by maximum likelihood using the conditional likelihood of given completely specifies the conditional distribution the multinomial distribution is appropriate 22

Model Fitting for General Cases (K classes, p Dimension) Let entire parameter set be, then Log likelihood for observations of input data and class labels: where Find the model that maximizes the log likelihood. 23

Example The subset of the Coronary Risk Factor Study (CORIS) baseline survey, carried out in three rural areas of the Western Cape, South Africa Aim: establish the intensity of ischemic heart disease risk factors in that high incidence region Response variable is the presence or absence of myocardial infraction (MI) at the time of survey 160 cases in data set, sample of 302 controls 24

Example 25

Example Fit a logistic regression model by maximum likelihood, giving the results shown in the next slide z scores for each coefficients in the model (coefficients divided by their standard errors) 26

Example Results from a logistic regression fit to the South African heart disease data: Coefficient Std. Error Z Score (Intercept) 4.130 0.964 4.285 sbp 0.006 0.006 1.023 tobacco 0.080 0.026 3.034 ldl 0.185 0.057 3.219 famhist 0.939 0.225 4.178 obesity 0.035 0.029 1.187 alcohol 0.001 0.004 0.136 age 0.043 0.010 4.184 27

Example z scores greater than approximately 2 in absolute value is significant at the 5% level Some surprises in the table of coefficients sbp and obesity appear to be not significant On their own, both sbp and obesity are significant, with positive sign Presence of many other correlated variables no longer needed (can even get a negative sign) 28

3 common transformation/ link function (provided by SAS): Logit : ln(p/1-p) (We call this log of odd ratio) Probit: Normal inverse of p (Recall: normal table s mapping scheme) Complementary log-log: ln(-ln(1-p)) The choice of link function depends on your purpose rather than performance. They all perform equally good but the implications is a bit different. 29

Related Measures 30

Related Measures Wald s Chi-square We could treat an effect as significant if the tail probability is small enough (< 5%). If we are using the model for predicting the outcome rather than the probability for that outcome (the case when the criterion is set to minimize loss), the interpretation for misclassification rate/ profit and loss/ ROC curve/ lift chart is similar to those for decision tree. 31