Logistic regression for conditional probability estimation

Similar documents
Uncertainty in prediction. Can we usually expect to get a perfect classifier, if we have enough training data?

Linear classifiers: Logistic regression

9 Classification. 9.1 Linear Classifiers

ECE521 Lecture7. Logistic Regression

Data Mining 2018 Logistic Regression Text Classification

ECE521 Lecture 7/8. Logistic Regression

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Linear classifiers: Overfitting and regularization

Multi-theme Sentiment Analysis using Quantified Contextual

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Classification: Analyzing Sentiment

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Introduction to Logistic Regression and Support Vector Machine

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Announcements - Homework

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Classification: Analyzing Sentiment

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Training the linear classifier

Statistical NLP for the Web

Logistic regression and linear classifiers COMS 4771

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Max Margin-Classifier

Logistic Regression. COMP 527 Danushka Bollegala

Regression with Numerical Optimization. Logistic

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Generative v. Discriminative classifiers Intuition

Machine Learning, Fall 2012 Homework 2

Math for Machine Learning Open Doors to Data Science and Artificial Intelligence. Richard Han

Machine Learning - Waseda University Logistic Regression

CS 340 Lec. 16: Logistic Regression

CPSC 340 Assignment 4 (due November 17 ATE)

Introduction to Machine Learning

CS229 Supplemental Lecture notes

Classification Based on Probability

Aiste: Do you often go to that festival?

Least Mean Squares Regression

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Classification Logistic Regression

Online Advertising is Big Business

ECE521 week 3: 23/26 January 2017

Support Vector Machines

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Overfitting, Bias / Variance Analysis

Adaptive Multi-Compositionality for Recursive Neural Models with Applications to Sentiment Analysis. July 31, 2014

Midterm. Introduction to Machine Learning. CS 189 Spring You have 1 hour 20 minutes for the exam.

Linear and logistic regression

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Neural Network Training

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Ad Placement Strategies

Introduction to Logistic Regression

Linear Models in Machine Learning

Classification: Logistic Regression from Data

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Final Exam, Spring 2006

COMP 875 Announcements

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Bias-Variance Tradeoff

Introduction to Machine Learning Midterm Exam

Midterm: CS 6375 Spring 2015 Solutions

Linear Models for Classification

Classification Logistic Regression

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Classification with Perceptrons. Reading:

ML (cont.): SUPPORT VECTOR MACHINES

Linear & nonlinear classifiers

Boosting: Foundations and Algorithms. Rob Schapire

Ad Placement Strategies

CSC 411: Lecture 04: Logistic Regression

The exam is closed book, closed calculator, and closed notes except your one-page crib sheet.

Warm up: risk prediction with logistic regression

Introduction to Machine Learning Midterm Exam Solutions

Natural Language Processing with Deep Learning CS224N/Ling284

Logistic Regression. Machine Learning Fall 2018

Lecture 9: Large Margin Classifiers. Linear Support Vector Machines

Section 9.1 (Part 2) (pp ) Type I and Type II Errors

Generative v. Discriminative classifiers Intuition

Naïve Bayes, Maxent and Neural Models

CSE 250a. Assignment Noisy-OR model. Out: Tue Oct 26 Due: Tue Nov 2

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Linear Models for Classification

Optimization and (Under/over)fitting

Linear Regression (continued)

18.9 SUPPORT VECTOR MACHINES

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

CENG 793. On Machine Learning and Optimization. Sinan Kalkan

11 More Regression; Newton s Method; ROC Curves

Deconstructing Data Science

DEPARTMENT OF COMPUTER SCIENCE AUTUMN SEMESTER MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

CSE 190 Fall 2015 Midterm DO NOT TURN THIS PAGE UNTIL YOU ARE TOLD TO START!!!!

Transcription:

Logistic regression for conditional probability estimation Instructor: Taylor Berg-Kirkpatrick Slides: Sanjoy Dasgupta Course website: http://cseweb.ucsd.edu/classes/wi19/cse151-b/

Uncertainty in prediction Can we usually expect to get a perfect classifier, if we have enough training data?

Uncertainty in prediction Can we usually expect to get a perfect classifier, if we have enough training data? Problem 1: Inherent uncertainty The available features x simply do not contain enough information to perfectly predict y, e.g., x = a heart patient s complete medical record y = will he/she survive another year?

Uncertainty in prediction, cont d Can we usually expect to get a perfect classifier, if we have enough training data?

Uncertainty in prediction, cont d Can we usually expect to get a perfect classifier, if we have enough training data? Problem 2: Limitations of the model class The type of classifier being used does not capture the decision boundary, e.g. using linear classifiers with:

Conditional probability estimation for binary labels Given: a data set of pairs (x, y), where x R d and y { 1, 1}. Return a classifier that also gives probabilities. That is, it gives Pr(y = 1 x).

Conditional probability estimation for binary labels Given: a data set of pairs (x, y), where x R d and y { 1, 1}. Return a classifier that also gives probabilities. That is, it gives Pr(y = 1 x). Simplest case: using a linear function of x.

A linear model for conditional probability estimation Classify and return probabilities using a linear function w o + w 1 x 1 + w 2 x 2 + + w d x d. The probability of y = 1: Increases as the linear function grows. Is 50% when this linear function is zero.

A linear model for conditional probability estimation Classify and return probabilities using a linear function w o + w 1 x 1 + w 2 x 2 + + w d x d. The probability of y = 1: Increases as the linear function grows. Is 50% when this linear function is zero. As before, add another feature x o 1 so we can forget w o. Now linear function is w x.

A linear model for conditional probability estimation Classify and return probabilities using a linear function w o + w 1 x 1 + w 2 x 2 + + w d x d. The probability of y = 1: Increases as the linear function grows. Is 50% when this linear function is zero. As before, add another feature x o 1 so we can forget w o. Now linear function is w x. w x = 0 = Pr(y = 1 x) = 1/2. w x = Pr(y = 1 x) 1. w x = Pr(y = 1 x) 0.

A linear model for conditional probability estimation Classify and return probabilities using a linear function w o + w 1 x 1 + w 2 x 2 + + w d x d. The probability of y = 1: Increases as the linear function grows. Is 50% when this linear function is zero. As before, add another feature x o 1 so we can forget w o. Now linear function is w x. w x = 0 = Pr(y = 1 x) = 1/2. w x = Pr(y = 1 x) 1. w x = Pr(y = 1 x) 0. How can we convert w x into a probability?

The logistic regression model Pr(y = 1 x) = 1 1 + e w x. The squashing function 1/(1 + e z ):

The conditional probability function For some vector w, we have Therefore, Pr(y = 1 x) = 1 Pr(y = 1 x) = 1 1 + e w x = 1 1 + e w x. e w x 1 + e w x = 1 1 + e w x

The conditional probability function For some vector w, we have Therefore, Pr(y = 1 x) = 1 Pr(y = 1 x) = 1 1 + e w x = 1 1 + e w x. e w x 1 + e w x = 1 1 + e w x So we can write Pr w (y x) = 1 1 + e y(w x)

Choosing w The maximum-likelihood principle: given a data set (x (1), y (1) ),..., (x (n), y (n) ) R d { 1, 1}, pick the w R d that maximizes n Pr w (y (i) x (i) ). i=1

Choosing w The maximum-likelihood principle: given a data set (x (1), y (1) ),..., (x (n), y (n) ) R d { 1, 1}, pick the w R d that maximizes n Pr w (y (i) x (i) ). i=1 Easier to work with sums, so take log to get loss function n L(w) = ln Pr w (y (i) x (i) ) = i=1 n ln(1 + e y (i) (w x (i)) ) i=1 Our goal is to minimize L(w).

Convexity The bad news: no closed-form solution for w

Convexity The bad news: no closed-form solution for w The good news: L(w) is convex in w. L(w) w w

Convexity The bad news: no closed-form solution for w The good news: L(w) is convex in w. L(w) w w How to find the minimum of a convex function? By local search.

Gradient descent procedure for logistic regression Given (x (1), y (1) ),..., (x (n), y (n) ) R d { 1, 1}, find arg min w R d L(w) = n ln(1 + e y (i) (w x (i)) ) i=1 Set w 0 = 0 For t = 0, 1, 2,..., until convergence: n w t+1 = w t + η t y (i) x (i) Pr wt ( y (i) x (i) ), }{{} i=1 doubt t(x (i),y (i) ) where η t is a step size that is fixed beforehand, or (for instance) chosen by line search to minimize L(w t+1 ).

Toy example

Toy example

Toy example

Sentiment data Data set: sentences from reviews on Amazon, Yelp, IMDB, each labeled as positive or negative.

Sentiment data Data set: sentences from reviews on Amazon, Yelp, IMDB, each labeled as positive or negative. Needless to say, I wasted my money. He was very impressed when going from the original battery to the extended battery.

Sentiment data Data set: sentences from reviews on Amazon, Yelp, IMDB, each labeled as positive or negative. Needless to say, I wasted my money. He was very impressed when going from the original battery to the extended battery. I have to jiggle the plug to get it to line up right to get decent volume. Will order from them again!

Sentiment data Data set: sentences from reviews on Amazon, Yelp, IMDB, each labeled as positive or negative. Needless to say, I wasted my money. He was very impressed when going from the original battery to the extended battery. I have to jiggle the plug to get it to line up right to get decent volume. Will order from them again! 2500 training sentences, 500 test sentences

Handling text data Bag-of-words: vectorial representation of text documents. It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only. 1 2 0 1 despair evil happiness foolishness

Handling text data Bag-of-words: vectorial representation of text documents. It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, it was the epoch of belief, it was the epoch of incredulity, it was the season of Light, it was the season of Darkness, it was the spring of hope, it was the winter of despair, we had everything before us, we had nothing before us, we were all going direct to Heaven, we were all going direct the other way in short, the period was so far like the present period, that some of its noisiest authorities insisted on its being received, for good or for evil, in the superlative degree of comparison only. 1 2 0 1 despair evil happiness foolishness Fix V = some vocabulary. Treat each document as a vector of length V : x = (x 1, x 2,..., x V ), where x i = # of times the ith word appears in the document.

A logistic regression approach Code positive as +1 and negative as 1. Pr w (y x) = 1 1 + e y(w x) Maximum-likelihood fit: Given training data (x (1), y (1) ),..., (x (n), y (n) ) R d { 1, 1}, find the w that minimizes L(w) = n ln(1 + e y (i) (w x (i)) ) i=1 Convex problem: many solution methods, including gradient descent, stochastic gradient descent, Newton-Raphson, quasi-newton, etc., will converge to the optimal w.

Local search in progress Look at how loss function L(w) changes over iterations of stochastic gradient descent.

Local search in progress Look at how loss function L(w) changes over iterations of stochastic gradient descent. Final w: test error 0.21.

Margin and test error Margin on test pt x = Pr w (y = 1 x) 1 2.

Some of the mistakes Not much dialogue, not much music, the whole film was shot as elaborately and aesthetically like a sculpture. 1 This film highlights the fundamental flaws of the legal process, that it s not about discovering guilt or innocence, but rather, is about who presents better in court. 1 The last 15 minutes of movie are also not bad as well. 1 You need two hands to operate the screen. This software interface is decade old and cannot compete with new software designs. -1 If you plan to use this in a car forget about it. -1 If you look for authentic Thai food, go else where. -1 Waste your money on this game. 1

Interpreting the model Words with the most positive coefficients sturdy, able, happy, disappoint, perfectly, remarkable, animation, recommendation, best, funny, restaurant, job, overly, cute, good, rocks, believable, brilliant, prompt, interesting, skimp, definitely, comfortable, amazing, tasty, wonderful, excellent, pleased, beautiful, fantastic, delicious, watch, soundtrack, predictable, nice, awesome, perfect, works, loved, enjoyed, love, great, happier, properly, liked, fun, screamy, masculine Words with the most negative coefficients disappointment, sucked, poor, aren, not, doesn, worst, average, garbage, bit, looking, avoid, roasted, broke, starter, disappointing, dont, waste, figure, why, sucks, slow, none, directing, stupid, lazy, unrecommended, unreliable, missing, awful, mad, hours, dirty, didn, probably, lame, sorry, horrible, fails, unfortunately, barking, bad, return, issues, rating, started, then, nothing, fair, pay