Lecture 15: Logistic Regression

Similar documents
Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

COMS 4771 Introduction to Machine Learning. James McInerney Adapted from slides by Nakul Verma

Kernel Logistic Regression and the Import Vector Machine

Linear Regression Models P8111

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

Logistic Regression. Machine Learning Fall 2018

Machine Learning, Fall 2012 Homework 2

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Naïve Bayes classification

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Linear Classification: Probabilistic Generative Models

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Lecture 2: Logistic Regression and Neural Networks

LogitBoost with Trees Applied to the WCCI 2006 Performance Prediction Challenge Datasets

CSE 473: Artificial Intelligence Autumn Topics

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Data Mining 2018 Logistic Regression Text Classification

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Logistic Regression. Seungjin Choi

Machine Learning Linear Classification. Prof. Matteo Matteucci

Linear & nonlinear classifiers

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

COMS 4771 Regression. Nakul Verma

Logistic Regression. COMP 527 Danushka Bollegala

BMI 541/699 Lecture 22

Introduction to Machine Learning

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Tufts COMP 135: Introduction to Machine Learning

Recap from previous lecture

ECE521 Lecture7. Logistic Regression

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

ECE 5984: Introduction to Machine Learning

Logistic Regression and Boosting for Labeled Bags of Instances

CS 188: Artificial Intelligence Spring Today

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Linear Models for Classification

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Boosting. Ryan Tibshirani Data Mining: / April Optional reading: ISL 8.2, ESL , 10.7, 10.13

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Introduction to Machine Learning

Proteomics and Variable Selection

Rapid Introduction to Machine Learning/ Deep Learning

CSCI-567: Machine Learning (Spring 2019)

CS489/698: Intro to ML

Lecture #11: Classification & Logistic Regression

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Hidden Markov Models

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 9

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Classification. Chapter Introduction. 6.2 The Bayes classifier

ESS2222. Lecture 4 Linear model

10-701/ Machine Learning - Midterm Exam, Fall 2010

CSC 411: Lecture 04: Logistic Regression

Classification Based on Probability

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

FSAN815/ELEG815: Foundations of Statistical Learning

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Classification objectives COMS 4771

Linear Models for Classification

Logistic Regression. Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul SLIDES ADAPTED FROM HINRICH SCHÜTZE

Intelligent Systems (AI-2)

CSC 411: Lecture 09: Naive Bayes

MLE/MAP + Naïve Bayes

Lecture 4 Logistic Regression

Generative v. Discriminative classifiers Intuition

Linear Classification

MLE/MAP + Naïve Bayes

CPSC 340: Machine Learning and Data Mining

Lecture 13: More uses of Language Models

Applied Machine Learning Annalisa Marsico

Introduction to Machine Learning

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Final Exam. December 11 th, This exam booklet contains five problems, out of which you are expected to answer four problems of your choice.

Intelligent Systems (AI-2)

Support Vector Machines for Classification: A Statistical Portrait

Logistic Regression and Generalized Linear Models

Machine Learning Lecture 5

Modeling Data with Linear Combinations of Basis Functions. Read Chapter 3 in the text by Bishop

Gradient-Based Learning. Sargur N. Srihari

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Jeff Howbert Introduction to Machine Learning Winter

Machine Learning Lecture 7

Maximum Entropy Klassifikator; Klassifikation mit Scikit-Learn

Stat 406: Algorithms for classification and prediction. Lecture 1: Introduction. Kevin Murphy. Mon 7 January,

The Naïve Bayes Classifier. Machine Learning Fall 2017

1 Machine Learning Concepts (16 points)

Support Vector Machines

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

CMU-Q Lecture 24:

Machine Learning: A Statistics and Optimization Perspective

Transcription:

Lecture 15: Logistic Regression William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 15

What we ll learn in this lecture Model-based regression and classification Logistic regression as a probabilistic classifier

Model-based regression and classification NB instance of model-based probabilistic classification In more general form, expressible as: P(c x) = f ( x, β) (1) where: f () is some function x vector of feature scores, {x 1,..., x n } β vector of feature weights, {β 0, β 1,..., β n } is for intercept β 0 More specifically: Idea is then to learn best β P(c x) = f ({β 0, β 1 x 1,..., β n x n }) (2)

Linear model P(c x)) = f ( x, β) = β 0 + β 1 x 1 +... + β n x n (3) Might try simple linear model Fitted with ordinary least squares ( straight line [hyperplane] of best fit)

Linear model P(c x) = β 0 + β 1 x 1 + β 2 x 2 +... + β n x n (4) 2.0 1.5 P(c x) 1.0 0.5 0.0 0.5 1.0 4 2 0 2 4 x But probabilities bound between 0 and 1 Meaning of probabilities outside range unclear Artificial to bound β to this range

Sigmoid model 2.0 1.5 P(c x) 1.0 0.5 0.0 0.5 1.0 4 2 0 2 4 x What we want is response variable (y, P(c x)) bounded between [0, 1] But predictor variable, x i, unbounded (at least by model) General shape of such a function is a sigmoid or S-shaped curve

Log-linear models P(c x) = β 0 β x 1 1... βxn n (5) log P(c x) = log β 0 + x 1 log β 1 +... + x n log β n (6) Natural (see NB) to express total probability as (weighted) product of individual probabilities exponentiated by frequency of events Taking log of this gives log-linear model Directly fit log β i, so can write as: log P(c x) = β 0 + β 1 x 1 +... + β n x n (7)

log(p) = βx log(p) = βx P(c x) 2.0 1.5 1.0 0.5 0.0 0.5 P = e βx 1.0 4 2 0 2 4 But curve has unbalanced shape: Fine granularity of response as P 0 Coarse response as P 1 x

Balanced in P Want behaviour that is same for high P and low P This is provided by log odds or logit: logit(p) = P log 1 P (8) logit(1 P) = logit(p) (9)

Logistic regression Putting this together, we get: logit P(c x) = log P(c x) 1 P(c x) P(c x) = = β 0 + β 1 x 1 +... + β n x n (10) 1 1 + e (β 0+β 1 x 1 +...+β nx n) (11) Expression on rhs of (11) known as logistic function So this is called logistic regression

Logistic function y = 1 1 + e (β 0+βx ) (12) 2.0 1.5 P(c x) 1.0 0.5 0.0 0.5 1.0 4 2 0 2 4 x And, happily, the logistic function sigmoid (Indeed, is archetypal sigmoid function)

Fitting the model Doc Terms (X d ) Class (y) 1 X 11 X 12 X 1t X 1n 1 2 X 21 X 22 X 2t X 2n 0.. d X d1 X d2 X dt X dn 0.. m X m1 X m2 X mt X mn 1 Training data feature vectors X with labels y Labels for binary classification: member, or non-member Have to determine vector β such that: ( P(y d matx d ) = 1 + exp( (β 0 + i best fits data Free to use any values for X dt Length-normalized TF*IDF one choice β i x i ))) 1 (13)

Data and model 1.0 0.8 y / P(y x) 0.6 0.4 0.2 0.0 6 4 2 0 2 4 6 x The data being fitted are binary The fitting value is a probability, P(y d = c X d ) We re fitting a curve of Bernoulli (one-event binomial) vars... that best fits the observed data

Maximum likelihood estimation For weights β, the likelihood of the data X and labels y given that model is: For logistic model: L( β) = l:y l =1 P(X l ) [1 P(X l )] (14) l:y l =0 P(X l ) = 1 1 + e (β 0+ i β i X li ) (15) We have to find β that maximizes (14) This is done by a computer using iterative methods

Logistic regression in practice Collection Classifier hotmail trec-2005 trec-2006 NB 0.2479 0.8196 0.8017 NB-IR 0.5561 0.9207 0.9521 Log. Reg 0.4877 0.9461 0.9384 SVM 0.4830 0.9477 0.9754 Table : Normalized AUC on spam filtering; from Kotz and Yih, Raising the Baseline for High-Precision Text Classifiers, KDD 2007. NB-IR is NB with IR features (length-normalized TF*IDF) Logistic regression for text classification generally almost, but not quite as good as SVM (Note, on this task, NB with LN-TF*IDF does well... and see paper for variants that do even better) On our GCAT 1000/1000 data, with length-normalized TF*IDF features, LR got accuracy 93%, F1 88%

Interpreting logistic regression: weights β i for term i gives importance of that term in model (but interpretation subject to term dependencies) For topic GCAT (Govt/Social), highest-weight terms were: Positive Negative Term Weight Term Weight sunday 0.869 shar -0.951 socc 0.643 newsroom -0.926 minist 0.635 trad -0.669 eu 0.629 stock -0.593 saturday 0.599 compan -0.580

Interpreting logistic regression: probabilites Logistic regression directly gives reasonable probabilities (given constraint of model) For GCAT 1000/1000 P(c) < % positive 0.00 0.05 2.4% 0.05 0.10 14.8% 0.10 0.30 26.9% 0.30 0.50 48.9% 0.50 0.70 74.2% 0.70 0.90 89.7% 0.90 0.95 93.8% 0.95 1.00 99.2%

Looking back and forward Back Model as P(c x) = f (β 1 x 1,, β n x n ) where x i is feature score (differs for each document) βi is feature weight (common across topics) Learn weights that best fit training data Free to use whatever values for x 1 (e.g. normalized TF*IDF) But probabilities bound between [0, 1]

Looking back and forward Back Sigmoid function maps unbounded feature scores to bounded probabilities Log odds gives even treatment to high, low probabilities Logistic model ties these together Learn weights β using maximum likelihood Effectiveness almost, but not quite as good as SVM But gives us feature weights, reasonable probabilities

Looking back and forward Forward Next lecture: advanced topics in classification e.g. active learning Later: topic modelling

Further reading Klienbaum and Klein, Logistic Regression, 3rd edn (2010) (detailed, gradual introduction to logistic regression) Hastie, Tibshirani, and Friedman, The ELements of Statistical Learning (2001) (briefer, more technical description)