Lecture 15: Logistic Regression

Size: px

Start display at page:

Download "Lecture 15: Logistic Regression"

Marybeth Allen
5 years ago
Views:

1 Lecture 15: Logistic Regression William Webber COMP90042, 2014, Semester 1, Lecture 15

2 What we ll learn in this lecture Model-based regression and classification Logistic regression as a probabilistic classifier

3 Model-based regression and classification NB instance of model-based probabilistic classification In more general form, expressible as: P(c x) = f ( x, β) (1) where: f () is some function x vector of feature scores, {x 1,..., x n } β vector of feature weights, {β 0, β 1,..., β n } is for intercept β 0 More specifically: Idea is then to learn best β P(c x) = f ({β 0, β 1 x 1,..., β n x n }) (2)

4 Linear model P(c x)) = f ( x, β) = β 0 + β 1 x β n x n (3) Might try simple linear model Fitted with ordinary least squares ( straight line [hyperplane] of best fit)

5 Linear model P(c x) = β 0 + β 1 x 1 + β 2 x β n x n (4) P(c x) x But probabilities bound between 0 and 1 Meaning of probabilities outside range unclear Artificial to bound β to this range

6 Sigmoid model P(c x) x What we want is response variable (y, P(c x)) bounded between [0, 1] But predictor variable, x i, unbounded (at least by model) General shape of such a function is a sigmoid or S-shaped curve

7 Log-linear models P(c x) = β 0 β x βxn n (5) log P(c x) = log β 0 + x 1 log β x n log β n (6) Natural (see NB) to express total probability as (weighted) product of individual probabilities exponentiated by frequency of events Taking log of this gives log-linear model Directly fit log β i, so can write as: log P(c x) = β 0 + β 1 x β n x n (7)

8 log(p) = βx log(p) = βx P(c x) P = e βx But curve has unbalanced shape: Fine granularity of response as P 0 Coarse response as P 1 x

9 Balanced in P Want behaviour that is same for high P and low P This is provided by log odds or logit: logit(p) = P log 1 P (8) logit(1 P) = logit(p) (9)

10 Logistic regression Putting this together, we get: logit P(c x) = log P(c x) 1 P(c x) P(c x) = = β 0 + β 1 x β n x n (10) e (β 0+β 1 x β nx n) (11) Expression on rhs of (11) known as logistic function So this is called logistic regression

11 Logistic function y = e (β 0+βx ) (12) P(c x) x And, happily, the logistic function sigmoid (Indeed, is archetypal sigmoid function)

12 Fitting the model Doc Terms (X d ) Class (y) 1 X 11 X 12 X 1t X 1n 1 2 X 21 X 22 X 2t X 2n 0.. d X d1 X d2 X dt X dn 0.. m X m1 X m2 X mt X mn 1 Training data feature vectors X with labels y Labels for binary classification: member, or non-member Have to determine vector β such that: ( P(y d matx d ) = 1 + exp( (β 0 + i best fits data Free to use any values for X dt Length-normalized TF*IDF one choice β i x i ))) 1 (13)

13 Data and model y / P(y x) x The data being fitted are binary The fitting value is a probability, P(y d = c X d ) We re fitting a curve of Bernoulli (one-event binomial) vars... that best fits the observed data

14 Maximum likelihood estimation For weights β, the likelihood of the data X and labels y given that model is: For logistic model: L( β) = l:y l =1 P(X l ) [1 P(X l )] (14) l:y l =0 P(X l ) = e (β 0+ i β i X li ) (15) We have to find β that maximizes (14) This is done by a computer using iterative methods

15 Logistic regression in practice Collection Classifier hotmail trec-2005 trec-2006 NB NB-IR Log. Reg SVM Table : Normalized AUC on spam filtering; from Kotz and Yih, Raising the Baseline for High-Precision Text Classifiers, KDD NB-IR is NB with IR features (length-normalized TF*IDF) Logistic regression for text classification generally almost, but not quite as good as SVM (Note, on this task, NB with LN-TF*IDF does well... and see paper for variants that do even better) On our GCAT 1000/1000 data, with length-normalized TF*IDF features, LR got accuracy 93%, F1 88%

16 Interpreting logistic regression: weights β i for term i gives importance of that term in model (but interpretation subject to term dependencies) For topic GCAT (Govt/Social), highest-weight terms were: Positive Negative Term Weight Term Weight sunday shar socc newsroom minist trad eu stock saturday compan

17 Interpreting logistic regression: probabilites Logistic regression directly gives reasonable probabilities (given constraint of model) For GCAT 1000/1000 P(c) < % positive % % % % % % % %

18 Looking back and forward Back Model as P(c x) = f (β 1 x 1,, β n x n ) where x i is feature score (differs for each document) βi is feature weight (common across topics) Learn weights that best fit training data Free to use whatever values for x 1 (e.g. normalized TF*IDF) But probabilities bound between [0, 1]

19 Looking back and forward Back Sigmoid function maps unbounded feature scores to bounded probabilities Log odds gives even treatment to high, low probabilities Logistic model ties these together Learn weights β using maximum likelihood Effectiveness almost, but not quite as good as SVM But gives us feature weights, reasonable probabilities

20 Looking back and forward Forward Next lecture: advanced topics in classification e.g. active learning Later: topic modelling

21 Further reading Klienbaum and Klein, Logistic Regression, 3rd edn (2010) (detailed, gradual introduction to logistic regression) Hastie, Tibshirani, and Friedman, The ELements of Statistical Learning (2001) (briefer, more technical description)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012) Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Linear Models for Regression Linear Regression Probabilistic Interpretation