Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Similar documents
Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

EE 511 Online Learning Perceptron

Introduction to Logistic Regression and Support Vector Machine

Machine Learning, Fall 2012 Homework 2

Machine Learning for NLP

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

MIRA, SVM, k-nn. Lirong Xia

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

ECE 5984: Introduction to Machine Learning

Naïve Bayes, Maxent and Neural Models

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Ad Placement Strategies

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Warm up: risk prediction with logistic regression

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Logistic

Logistic Regression. William Cohen

Binary Classification / Perceptron

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

CS 188: Artificial Intelligence. Outline

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Generative v. Discriminative classifiers Intuition

Linear discriminant functions

Logistic Regression & Neural Networks

Linear classifiers: Overfitting and regularization

Generative v. Discriminative classifiers Intuition

Probability Review and Naïve Bayes

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Logistic Regression. Machine Learning Fall 2018

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Stochastic Gradient Descent

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

Classification, Linear Models, Naïve Bayes

Linear classifiers: Logistic regression

Bias-Variance Tradeoff

Machine Learning for NLP

Logistic Regression. COMP 527 Danushka Bollegala

Voting (Ensemble Methods)

The Bayes classifier

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning

Machine Learning Gaussian Naïve Bayes Big Picture

Midterm sample questions

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

ECE 5424: Introduction to Machine Learning

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Statistical NLP Spring A Discriminative Approach

Week 5: Logistic Regression & Neural Networks

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

Introduction to Machine Learning

ANLP Lecture 10 Text Categorization with Naive Bayes

MLE/MAP + Naïve Bayes

Machine Learning Linear Models

CPSC 340 Assignment 4 (due November 17 ATE)

Regression with Numerical Optimization. Logistic

Loss Functions, Decision Theory, and Linear Models

The Naïve Bayes Classifier. Machine Learning Fall 2017

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Machine Learning 2017

Machine Learning Lecture 5

Introduction to Machine Learning

Bayesian Methods: Naïve Bayes

Support Vector Machines

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Naïve Bayes classification

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

CS Machine Learning Qualifying Exam

Introduction to Machine Learning

Is the test error unbiased for these programs?

Multiclass Classification-1

Linear Regression (continued)

CS 343: Artificial Intelligence

Learning Decision Trees

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

0.5. (b) How many parameters will we learn under the Naïve Bayes assumption?

Generative Clustering, Topic Modeling, & Bayesian Inference

Lecture 5 Neural models for NLP

Lecture 4 Logistic Regression

Machine Learning. Linear Models. Fabio Vandin October 10, 2017

CS 5522: Artificial Intelligence II

Classification Based on Probability

COMP 652: Machine Learning. Lecture 12. COMP Lecture 12 1 / 37

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Transcription:

Logistic Regression Some slides adapted from Dan Jurfasky and Brendan O Connor

Naïve Bayes Recap Bag of words (order independent) Features are assumed independent given class P (x 1,...,x n c) =P (x 1 c)...p(x n c) Q: Is this really true?

The problem with assuming conditional independence Correlated features -> double counting evidence Parameters are estimated independently This can hurt classifier accuracy and calibration

Logistic Regression (Log) Linear Model similar to Naïve Bayes Doesn t assume features are independent Correlated features don t double count

What are Features? A feature function, f Input: Document, D (a string) Output: Feature Vector, X

What are Features? f(d) = 0 B @ count( boring ) count( not boring ) length of document author of document... 1 C A Doesn t have to be just bag of words

Feature Templates Typically feature templates are used to generate many features at once For each word: ${w}_count ${w}_lowercase ${w}_with_not_before_count

Logistic Regression: Example Compute Features: 0 f(d i )=x i = @ 1 count( nigerian ) count( prince ) A count( nigerian prince ) Assume we are given some weights: 0 w = @ 1 1.0 1.0A 4.0

Logistic Regression: Example Compute Features We are given some weights Compute the dot product: z = X X w i x i i=0

Logistic Regression: Example Compute the dot product: z = X X i=0 w i x i Compute the logistic function: P (spam x) = ez e z +1 = 1 1+e z

The Logistic function P (spam x) = ez e z +1 = 1 1+e z

The Dot Product z = X X w i x i i=0 Intuition: weighted sum of features All Linear models have this form

Naïve Bayes as a log-linear model Q: what are the features? Q: what are the weights?

Naïve Bayes as a Log-Linear Model P (spam D) / P (spam) Y w2d P (w spam) P (spam D) / P (spam) Y w2vocab P (w spam) x i log P (spam D) / log P (spam) + X w2vocab x i log P (w spam)

Naïve Bayes as a Log-Linear Model log P (spam D) / log P (spam) + X w2vocab x i log P (w spam) In both Naïve Bayes and Logistic Regression we Compute The Dot Product!

NB vs. LR Both compute the dot product NB: sum of log probabilities LR: logistic function

Naïve Bayes: NB vs. LR: Parameter Learning Learn conditional probabilities independently by counting Logistic Regression: Learn weights jointly

LR: Learning Weights Given: a set of feature vectors and labels Goal: learn the weights

Learning Weights Document Feature 2 3 x 11 x 12 x 13... x 1n x 21 x 22 x 23... x 2n 6....... 4... 7..... 5 x d1 x d2 x d3... x dn 0 B @ Labels y 1 y 2.. y n 1 C A

Q: what parameters should we choose? What is the right value for the weights? Maximum Likelihood Principle: Pick the parameters that maximize the probability of the data

Maximum Likelihood Estimation MLE = argmax w log P (y 1,...,y d x 1,...,x d ; w) X = argmax w log P (y i x i ; w) i ( X p i, if y i =1 = argmax w log 1 p i, if y i =0 = argmax w X i i log p I(y i=1) i (1 p i ) I(y i=0)

Maximum Likelihood Estimation = argmax w X = argmax w X i i log p I(y i=1) i (1 p i ) I(y i=0) y i log p i +(1 y i ) log(1 p i )

Maximum Likelihood Estimation Unfortunately there is no closed form solution (like there was with naïve bayes) Solution: Iteratively climb the log-likelihood surface through the derivatives for each weight Luckily, the derivatives turn out to be nice

Gradient ascent Loop While not converged: For all features j, compute and add derivatives w new j = w old j L(w): Training set log-likelihood @L @w 1, @L @w 2,..., @L @w n + @ @w j L(w) : Gradient vector

Gradient ascent w 2 w 1

Gradient ascent w 2 w 1

LR Gradient @L @w j = X i (y i p i )x j

Logistic Regression: Pros and Cons Doesn t assume conditional independence of features Better calibrated probabilities Can handle highly correlated overlapping features NB is faster to train, less likely to overfit

NB & LR Both are linear models z = X X i=0 w i x i Training is different: NB: weights are trained independently LR: weights trained jointly

Perceptron Algorithm Very similar to logistic regression Not exactly computing gradients

Perceptron Algorithm Algorithm is Very similar to logistic regression Not exactly computing gradients Initalize weight vector w = 0 Loop for K iterations Loop For all training examples x_i if sign(w * x_i)!= y_i w += (y_i - sign(w * x_i)) * x_i

Differences between LR and Perceptron Online learning vs. Batch Perceptron doesn t always make updates

MultiClass Classification Q: what if we have more than 2 categories? Sentiment: Positive, Negative, Neutral Document topics: Sports, Politics, Business, Entertainment, Could train a seperate logistic regression model for each category... Pretty clear what to do with Naive Bayes.

Log-Linear Models P (y x) / e w f(d,y) P (y x) = 1 Z(w) ew f(d,y)

MultiClass Logistic Regression P (y x) / e w f(d,y) P (y x) = 1 Z(w) ew f(d,y) P (y x) = P e w f(d,y) y 0 2Y ew f(d,y0 )

MultiClass Logistic Regression Binary logistic regression: We have one feature vector that matches the size of the vocabulary Can represent this in practice with Multiclass in practice: one giant weight vector and one weight vector for each category repeated features for each category. w pos w neg w neut

Q: How to compute posterior class probabilities for multiclass? P (y = j x i )= ew j x i P k ew k x i

Maximum Likelihood Estimation MLE = argmax w log P (y 1,...,y n x 1,...,x n ; w) = argmax w X i log P (y i x i ; w) = argmax w X i log P e w f(x i,y i ) y 0 2Y ew f(x i,y i )

Multiclass LR Gradient @L @w j = DX i=1 f j (y i,d i ) DX i=1 X y2y f j (y, d i )P (y d i )

MAP-based learning (perceptron) @L @w j = DX i=1 f j (y i,d i ) DX i=1 f j (arg max y2y P (y d i),d i )

Online Learning (perceptron) Rather than making a full pass through the data, compute gradient and update parameters after each training example.

MultiClass Perceptron Algorithm Initalize weight vector w = 0 Loop for K iterations Loop For all training examples x_i y_pred = argmax_y w_y * x_i if y_pred!= y_i w_y_gold += x_i w_y_pred -= x_i

Q: what if there are only 2 categories? P (y = j x i )= ew j x i P k ew k x i

Q: what if there are only 2 categories? P (y =1 x) = e w 1 x e w 0 x+w 1 x w 1 x + e w 1 x

Q: what if there are only 2 categories? P (y =1 x) = e w 1 x e w 0 x w 1 x e w 1 x + e w 1 x

Q: what if there are only 2 categories? P (y =1 x) = e w 1 x e w 1 x (e w 0 x w 1 x + 1)

Q: what if there are only 2 categories? P (y =1 x) = 1 e w 0 x w 1 x +1

Q: what if there are only 2 categories? P (y =1 x) = 1 e w0 x +1

Regularization Combating over fitting Intuition: don t let the weights get very large w MLE = argmax w log P (y 1,...,y d x 1,...,x d ; w) argmax w log P (y 1,...,y d x 1,...,x d ; w) VX i=1 w 2 i

Regularization in the Perceptron Algorithm Can t directly include regularization in gradient # of iterations Parameter averaging