Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Similar documents
Machine Learning. Lecture 3: Logistic Regression. Feng Li.

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Gaussian discriminant analysis Naive Bayes

Logistic Regression. Seungjin Choi

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Generalized Linear Models and Exponential Families

Ch 4. Linear Models for Classification

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Machine Learning Lecture Notes

Naive Bayes and Gaussian Bayes Classifier

Naive Bayes and Gaussian Bayes Classifier

Introduction to Probabilistic Machine Learning

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

CPSC 340: Machine Learning and Data Mining

Notes on Discriminant Functions and Optimal Classification

Naive Bayes and Gaussian Bayes Classifier

Week 5: Logistic Regression & Neural Networks

Classification Logistic Regression

Machine Learning Practice Page 2 of 2 10/28/13

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

CPSC 540: Machine Learning

ECE 5984: Introduction to Machine Learning

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

CSCI-567: Machine Learning (Spring 2019)

CSC 411: Lecture 09: Naive Bayes

Machine Learning

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Rapid Introduction to Machine Learning/ Deep Learning

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gradient-Based Learning. Sargur N. Srihari

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Logistic Regression. Machine Learning Fall 2018

Announcements. Proposals graded

Introduction to Machine Learning

Machine Learning and Data Mining. Linear classification. Kalev Kask

Overfitting, Bias / Variance Analysis

Machine Learning Lecture 5

Machine learning - HT Maximum Likelihood

How can a Machine Learn: Passive, Active, and Teaching

An Introduction to Statistical and Probabilistic Linear Models

MLE/MAP + Naïve Bayes

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Inf2b Learning and Data

Machine Learning for Signal Processing Bayes Classification

Nearest Neighbor. Machine Learning CSE546 Kevin Jamieson University of Washington. October 26, Kevin Jamieson 2

Introduction to Logistic Regression

1 Machine Learning Concepts (16 points)

Iterative Reweighted Least Squares

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Classification Based on Probability

Statistical Methods for NLP

Stochastic Gradient Descent

Machine Learning. Lecture 04: Logistic and Softmax Regression. Nevin L. Zhang

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Kernel Methods and Support Vector Machines

CIS 520: Machine Learning Oct 09, Kernel Methods

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

MLE/MAP + Naïve Bayes

Linear Classification

Introduction to Machine Learning

Generative v. Discriminative classifiers Intuition

CS Machine Learning Qualifying Exam

The Bayes classifier

Why should you care about the solution strategies?

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Linear & nonlinear classifiers

Statistical Machine Learning Hilary Term 2018

Linear Methods for Prediction

Linear Regression and Its Applications

Is the test error unbiased for these programs?

Parameter estimation Conditional risk

Data Mining 2018 Logistic Regression Text Classification

Introduction to Bayesian Learning. Machine Learning Fall 2018

Linear Models for Classification

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

CS489/698: Intro to ML

Least Squares Regression

Classification objectives COMS 4771

Machine Learning

Neural Network Training

Logistic Regression Trained with Different Loss Functions. Discussion

Machine Learning, Fall 2012 Homework 2

Warm up: risk prediction with logistic regression

Machine Learning for Signal Processing Bayes Classification and Regression

Linear and Logistic Regression. Dr. Xiaowei Huang

Regression with Numerical Optimization. Logistic

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Transcription:

Logistic regression

Comments Mini-review and feedback These are equivalent: x > w = w > x Clarification: this course is about getting you to be able to think as a machine learning expert There has to be some confusion to start A bit different than other courses, where need to learn a topic x (e.g., calculus) without really needing to understand why you learn topic x Here you are being trained to think about how to formulate a problem, solve the problem and evaluate the problem all at once Keeping it all straight takes some time 2

Exercise: what is c(w)? Recall when we did MLE (maximum likelihood) for Poisson r œ R. We will estimate Assumed p(x) was Poisson, learned lambda min w c(w) ML = arg max œ(0,œ) nction as {p(d )} w = c(w) = ln p(d ) 3

Why is Poisson regression more complicated? Estimated lambda for Poisson p(y), had closed form Estimated Poisson p(y x), no longer have closed form! Why not?! Why do we focus on Poisson distribution? Just a canonical example, not necessarily particularly important Why do variable names change? Previously had lambda and now have w? lambda is now a function of x, i.e., =exp(x > w) 4

Poisson regression p(y x) = Poisson(y =exp(x > w)). log(e[y x]) =! T x 2. p(y x) =Poisson( ) 5 For exponential families, transfer f corresponds to derivate of a p(y ) =exp( y a( )+b(y))

Examples = x > w Gaussian distribution a( ) = 2 2 f( ) = Poisson distribution a( ) =exp( ) f( ) =exp( ) Bernoulli distribution a( ) =ln(+exp( )) f( ) = + exp( ) 6

Exercise: How do we extract the form for the exponential distribution? exp( y) Recall exponential family distribution = f( ) = f ( ) p(y ) =exp( y a( )+b(y)) How do we write the exponential distribution this way? What is the transfer f? 7

What is c(w) for GLMS? Still formulating an optimization problem to predict targets y given features x The variables we learn is the weight vector w 8 What is c(w)? arg min w MLE : c(w) / / c(w) = arg max w Can we add regularization? How? ln p(d w) nx ln p(y i x i w) i= p(d w) Add a prior, do MAP!

Extra exercises Go through the derivation of c(w) for logistic regression Derive Maximum Likelihood objective in Section 8..2 9

Benefits of GLMs Gave a generic update rule, where you only needed to know the transfer for your chosen distribution e.g., linear regression with transfer f = identity e.g., Poisson regression with transfer f = exp e.g., logistic regression with transfer f = sigmoid We know the objective is convex in w! 0

Convexity Convexity of negative log likelihood of (many) exponential families The negative log likelihood of many exponential families is convex, which is an important advantage of the maximum likelihood approach Why is convexity important? e.g., why is (sigmoid(xw) - y)^2 not a good choice for binary classification? Euclidean loss (squared loss) for sigmoid results in a non-convex function

How can we check convexity? Can check the definition of convexity Can check second derivative for scalar parameters (e.g. ) and Hessian for multidimensional parameters (e.g., ) e.g., for linear regression (least-squares), the Hessian is and so clearly positive semi-definite e.g., for Poisson regression, the Hessian of the negative loglikelihood is and so clearly positive semi-definite H = X > CX w H = X > X 2

Logistic regression. logit(e[y x]) =! T x 2. p(y x) = Bernoulli( ) x where logit(x) =ln x, y 2 {0, }, and 2 (0, ) of the Bernoulli distribution. It follows that E[y x] = +e!t x = p(y = x) g(x > w) = logit(x > w) f(x > w)=g (x > w) = sigmoid(x > w) = E[y x] p(y x) = +e!t x y +e!t x y. a set D { }, where 2 R and 2 { 0.9 0.8 0.7 0.6 0.5 0.4 0.3 f(t) = +e t 0.2 0. 3 0 5 0 5 t

Prediction with logistic regression So far, we have used the prediction f(xw) eg., xw for linear regression, exp(xw) for Poisson regression For binary classification, want to output 0 or, rather than the probability value p(y = x) = sigmoid(xw) Sigmoid has few values xw mapped close to 0.5; most values somewhat larger than 0 are mapped close to 0 (and vice versa for ) Decision threshold: sigmoid(xw) < 0.5 is class 0 sigmoid(xw) > 0.5 is class 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 f(t) = +e t 4 0. 0 5 0 5 t

Logistic regression is a linear Hyperplane one. Note th w > x =0 negative exa P(y= x, w) > 0.5 only when classifier separates the two classes P(y=0 x, w) > 0.5 only when P(y= x, w) < 0.5, which happens when w > x < 0,x,...,x d w > x 0. that separa 0.9 0.8 f(t) = +e t x 2 ( x 2, y 2 ) ( x i, y i ) 0.7 0.6 0.5 0.4 0.3 + + + + x y + (, ) + + + + 0.2 0. 0 5 0 5 w + w x + w x = 0 0 2 2 t 5 x

Logistic regression versus linear regression Why might one be better than the other? They both use a linear approach Linear regression could still learn <x, w> to predict E[Y x] Demo: logistic regression performs better under outliers, when the outlier is still on the correct side of the line Conclusion: logistic regression better reflects the goals of predicting p(y= x), to finding separating hyperplane Linear regression assumes E[Y x] a linear function of x! 6

Whiteboard Logistic regression maximum likelihood with weightings on samples optimization strategy issues with minimizing Euclidean distance for sigmoid Multinomial logistic regression Next class: generative approach: naive Bayes 7