Generative v. Discriminative classifiers Intuition

Similar documents
Generative v. Discriminative classifiers Intuition

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Bias-Variance Tradeoff

ECE 5984: Introduction to Machine Learning

Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Generative v. Discriminative classifiers Intuition

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Machine Learning

Machine Learning

Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Stochastic Gradient Descent

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Support Vector Machines

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Logistic Regression Logistic

Classification Logistic Regression

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning, Fall 2012 Homework 2

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Machine Learning

Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Week 5: Logistic Regression & Neural Networks

Logistic Regression. Machine Learning Fall 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Ad Placement Strategies

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Linear Models for Classification

ECE521 week 3: 23/26 January 2017

CSE546: Naïve Bayes Winter 2012

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Estimating Parameters

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Support Vector Machines. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Linear discriminant functions

Linear classifiers: Logistic regression

Linear classifiers: Overfitting and regularization

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Undirected Graphical Models

10-701/ Machine Learning - Midterm Exam, Fall 2010

Bayesian Learning (II)

Support Vector Machines

CS 340 Lec. 16: Logistic Regression

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

CSC 411: Lecture 04: Logistic Regression

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

CSC 411: Lecture 09: Naive Bayes

Logistic Regression. William Cohen

Hidden Markov Models

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

SVMs, Duality and the Kernel Trick

An Introduction to Statistical and Probabilistic Linear Models

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Naïve Bayes classification

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Machine Learning

Gaussian Mixture Models

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Lecture 2 Machine Learning Review

Introduction to Logistic Regression and Support Vector Machine

Lecture 5: Logistic Regression. Neural Networks

Machine Learning - Waseda University Logistic Regression

CSE446: Clustering and EM Spring 2017

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Learning with Noisy Labels. Kate Niehaus Reading group 11-Feb-2014

Logistic Regression. Sargur N. Srihari. University at Buffalo, State University of New York USA

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Linear Models in Machine Learning

Linear & nonlinear classifiers

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

PAC-learning, VC Dimension and Margin-based Bounds

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Classification Logistic Regression

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Machine Learning

Midterm: CS 6375 Spring 2018

Introduction to Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Machine Learning Practice Page 2 of 2 10/28/13

CMU-Q Lecture 24:

Machine Learning Lecture 7

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning

Introduction to Machine Learning

Transcription:

Logistic Regression Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University September 24 th, 2007 1 Generative v. Discriminative classifiers Intuition Want to Learn: h:x a Y X features Y target classes Bayes optimal classifier P(Y X) Generative classifier, e.g., Naïve Bayes: Assume some functional form for P(X Y), P(Y) Estimate parameters of P(X Y), P(Y) directly from training data Use Bayes rule to calculate P(Y X= x) This is a generative model Indirect computation of P(Y X) through Bayes rule But, can generate a sample of the data, P(X) = y P(y) P(X y) Discriminative classifiers, e.g., Logistic Regression: Assume some functional form for P(Y X) Estimate parameters of P(Y X) directly from training data This is the discriminative model Directly learn P(Y X) But cannot obtain a sample of the data, because P(X) is not available 2 1

Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly! Assume a particular functional form Sigmoid applied to a linear function of the data: Z Features can be discrete or continuous! 3 Understanding the sigmoid w 0 =-2, w 1 =-1 w 0 =0, w 1 =-1 w 0 =0, w 1 =-0.5 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 4 2

Logistic Regression a Linear classifier 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0-6 -4-2 0 2 4 6 5 Very convenient! implies implies implies linear classification rule! 6 3

What if we have continuous X i? Eg., character recognition: X i is i th pixel Gaussian Naïve Bayes (GNB): Sometimes assume variance is independent of Y (i.e., σ i ), or independent of X i (i.e., σ k ) or both (i.e., σ) 7 Example: GNB for classifying mental states [Mitchell et al.] ~1 mm resolution ~2 images per sec. 15,000 voxels/image non-invasive, safe 10 sec measures Blood Oxygen Level Dependent (BOLD) response Typical impulse response 8 4

Learned Bayes Models Means for P(BrainActivity WordCategory) Pairwise classification accuracy: 85% People words Animal words [Mitchell et al.] 9 Logistic regression v. Naïve Bayes Consider learning f: X Y, where X is a vector of real-valued features, < X 1 X n > Y is boolean Could use a Gaussian Naïve Bayes classifier assume all X i are conditionally independent given Y model P(X i Y = y k ) as Gaussian N(µ ik,σ i ) model P(Y) as Bernoulli(θ,1-θ) What does that imply about the form of P(Y X)? Cool!!!! 10 5

Derive form for P(Y X) for continuous X i 11 Ratio of class-conditional probabilities 12 6

Derive form for P(Y X) for continuous X i 13 Gaussian Naïve Bayes v. Logistic Regression Set of Gaussian Naïve Bayes parameters (feature variance independent of class label) Set of Logistic Regression parameters Representation equivalence But only in a special case!!! (GNB with class-independent variances) But what s the difference??? LR makes no assumptions about P(X Y) in learning!!! Loss function!!! Optimize different functions! Obtain different solutions 14 7

Logistic regression for more than 2 classes Logistic regression in more general case, where Y 2 {Y 1... Y R } : learn R-1 sets of weights 15 Logistic regression more generally Logistic regression in more general case, where Y 2 {Y 1... Y R } : learn R-1 sets of weights for k<r for k=r (normalization, so no weights for this class) Features can be discrete or continuous! 16 8

Loss functions: Likelihood v. Conditional Likelihood Generative (Naïve Bayes) Loss function: Data likelihood Discriminative models cannot compute P(x j w)! But, discriminative (logistic regression) loss function: Conditional Data Likelihood Doesn t waste effort learning P(X) focuses on P(Y X) all that matters for classification 17 Expressing Conditional Log Likelihood 18 9

Maximizing Conditional Log Likelihood Good news: l(w) is concave function of w! no locally optimal solutions Bad news: no closed-form solution to maximize l(w) Good news: concave functions easy to optimize 19 Optimizing concave function Gradient ascent Conditional likelihood for Logistic Regression is concave! Find optimum with gradient ascent Gradient: Learning rate, η>0 Update rule: Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent much better (see reading) 20 10

Maximize Conditional Log Likelihood: Gradient ascent 21 Gradient Descent for LR Gradient ascent algorithm: iterate until change < ε For i = 1 n, repeat 22 11

That s all M(C)LE. How about MAP? One common approach is to define priors on w Normal distribution, zero mean, identity covariance Pushes parameters towards zero Corresponds to Regularization Helps avoid very large weights and overfitting More on this later in the semester MAP estimate 23 M(C)AP as Regularization Penalizes high weights, also applicable in linear regression 24 12

Gradient of M(C)AP 25 MLE vs MAP Maximum conditional likelihood estimate Maximum conditional a posteriori estimate 26 13

Naïve Bayes vs Logistic Regression Consider Y boolean, X i continuous, X=<X 1... X n > Number of parameters: NB: 4n +1 LR: n+1 Estimation method: NB parameter estimates are uncoupled LR parameter estimates are coupled 27 G. Naïve Bayes vs. Logistic Regression 1 Generative and Discriminative classifiers [Ng & Jordan, 2002] Asymptotic comparison (# training examples infinity) when model correct GNB, LR produce identical classifiers when model incorrect LR is less biased does not assume conditional independence therefore LR expected to outperform GNB 28 14

G. Naïve Bayes vs. Logistic Regression 2 Generative and Discriminative classifiers Non-asymptotic analysis [Ng & Jordan, 2002] convergence rate of parameter estimates, n = # of attributes in X Size of training data to get close to infinite data solution GNB needs O(log n) samples LR needs O(n) samples GNB converges more quickly to its (perhaps less helpful) asymptotic estimates 29 Naïve bayes Logistic Regression Some experiments from UCI data sets 30 15

What you should know about Logistic Regression (LR) Gaussian Naïve Bayes with class-independent variances representationally equivalent to LR Solution differs because of objective (loss) function In general, NB and LR make different assumptions NB: Features independent given class! assumption on P(X Y) LR: Functional form of P(Y X), no assumption on P(X Y) LR is a linear classifier decision rule is a hyperplane LR optimized by conditional likelihood no closed-form solution concave! global optimum with gradient ascent Maximum conditional a posteriori corresponds to regularization Convergence rates GNB (usually) needs less data LR (usually) gets to better solutions in the limit 31 16