Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Similar documents
Machine Learning

Machine Learning Gaussian Naïve Bayes Big Picture

Machine Learning

Machine Learning

Generative v. Discriminative classifiers Intuition

Bias-Variance Tradeoff

Generative v. Discriminative classifiers Intuition

ECE 5984: Introduction to Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Generative v. Discriminative classifiers Intuition

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Stochastic Gradient Descent

Machine Learning, Fall 2012 Homework 2

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning

Logistic Regression. Machine Learning Fall 2018

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Classification Logistic Regression

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Estimating Parameters

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Support Vector Machines

Machine Learning

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

10-701/ Machine Learning - Midterm Exam, Fall 2010

Introduction to Machine Learning

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Logistic Regression. William Cohen

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Discriminative v. generative

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Support Vector Machines

ECE521 week 3: 23/26 January 2017

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Linear discriminant functions

Machine Learning

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Linear classifiers: Overfitting and regularization

Machine Learning

CSC 411: Lecture 09: Naive Bayes

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Voting (Ensemble Methods)

Linear classifiers: Logistic regression

Week 5: Logistic Regression & Neural Networks

SVMs, Duality and the Kernel Trick

Introduction to Logistic Regression and Support Vector Machine

The Naïve Bayes Classifier. Machine Learning Fall 2017

MLE/MAP + Naïve Bayes

Naïve Bayes classification

Machine Learning (CSE 446): Probabilistic Machine Learning

Ad Placement Strategies

ECE521 lecture 4: 19 January Optimization, MLE, regularization

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Linear Models for Regression CS534

Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Midterm: CS 6375 Spring 2015 Solutions

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning for NLP

Introduction to Bayesian Learning. Machine Learning Fall 2018

Overfitting, Bias / Variance Analysis

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Lecture 5: Logistic Regression. Neural Networks

COMP90051 Statistical Machine Learning

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Hidden Markov Models

Midterm exam CS 189/289, Fall 2015

CSC 411: Lecture 04: Logistic Regression

Regression with Numerical Optimization. Logistic

Statistical Data Mining and Machine Learning Hilary Term 2016

Midterm. Introduction to Machine Learning. CS 189 Spring Please do not open the exam before you are instructed to do so.

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Day 5: Generative models, structured classification

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Machine Learning

MLE/MAP + Naïve Bayes

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Bayesian Learning (II)

Linear Models for Regression CS534

The Bayes classifier

Lecture 2 Machine Learning Review

Transcription:

Machine Learning 10-601 Tom M. Mitchell Machine Learning Department Carnegie Mellon University September 20, 2012 Today: Logistic regression Generative/Discriminative classifiers Readings: (see class website) Required: Mitchell: Naïve Bayes and Logistic Regression Optional Ng & Jordan Logistic Regression Idea: Naïve Bayes allows computing P(Y X) by learning P(Y) and P(X Y) Why not learn P(Y X) directly? 1

Consider learning f: X à Y, where X is a vector of real-valued features, < X 1 X n > Y is boolean assume all X i are conditionally independent given Y model P(X i Y = y k ) as Gaussian N(µ ik,σ i ) model P(Y) as Bernoulli (π) What does that imply about the form of P(Y X)? Derive form for P(Y X) for continuous X i 2

Very convenient! implies implies implies Very convenient! implies implies implies linear classification rule! 3

Logistic function 4

Logistic regression more generally Logistic regression when Y not boolean (but still discrete-valued). Now y {y 1... y R } : learn R-1 sets of weights for k<r for k=r Training Logistic Regression: MCLE we have L training examples: maximum likelihood estimate for parameters W maximum conditional likelihood estimate 5

Training Logistic Regression: MCLE Choose parameters W=<w 0,... w n > to maximize conditional likelihood of training data where Training data D = Data likelihood = Data conditional likelihood = Expressing Conditional Log Likelihood 6

Maximizing Conditional Log Likelihood Good news: l(w) is concave function of W Bad news: no closed-form solution to maximize l(w) 7

Gradient Descent: Batch gradient: use error Do until satisfied: over entire training set D 1. Compute the gradient 2. Update the vector of parameters: Stochastic gradient: use error over single examples Do until satisfied: 1. Choose (with replacement) a random training example 2. Compute the gradient just for : 3. Update the vector of parameters: Stochastic approximates Batch arbitrarily closely as Stochastic can be much faster when D is very large Intermediate approach: use error over subsets of D Maximize Conditional Log Likelihood: Gradient Ascent 8

Maximize Conditional Log Likelihood: Gradient Ascent Gradient ascent algorithm: iterate until change < ε For all i, repeat That s all for M(C)LE. How about MAP? One common approach is to define priors on W Normal distribution, zero mean, identity covariance Helps avoid very large weights and overfitting MAP estimate let s assume Gaussian prior: W ~ N(0, σ) 9

MLE vs MAP Maximum conditional likelihood estimate Maximum a posteriori estimate with prior W~N(0,σI) MAP estimates and Regularization Maximum a posteriori estimate with prior W~N(0,σI) called a regularization term helps reduce overfitting, especially when training data is sparse keep weights nearer to zero (if P(W) is zero mean Gaussian prior), or whatever the prior suggests used very frequently in Logistic Regression 10

The Bottom Line Consider learning f: X à Y, where X is a vector of real-valued features, < X 1 X n > Y is boolean assume all X i are conditionally independent given Y model P(X i Y = y k ) as Gaussian N(µ ik,σ i ) model P(Y) as Bernoulli (π) Then P(Y X) is of this form, and we can directly estimate W Furthermore, same holds if the X i are boolean trying proving that to yourself Generative vs. Discriminative Classifiers Training classifiers involves estimating f: X à Y, or P(Y X) Generative classifiers (e.g., Naïve Bayes) Assume some functional form for P(X Y), P(X) Estimate parameters of P(X Y), P(X) directly from training data Use Bayes rule to calculate P(Y X= x i ) Discriminative classifiers (e.g., Logistic regression) Assume some functional form for P(Y X) Estimate parameters of P(Y X) directly from training data 11

Use Naïve Bayes or Logisitic Regression? Consider Restrictiveness of modeling assumptions Rate of convergence (in amount of training data) toward asymptotic hypothesis Naïve Bayes vs Logistic Regression Consider Y boolean, X i continuous, X=<X 1... X n > Number of parameters to estimate: NB: LR: 12

Naïve Bayes vs Logistic Regression Consider Y boolean, X i continuous, X=<X 1... X n > Number of parameters: NB: 4n +1 LR: n+1 Estimation method: NB parameter estimates are uncoupled LR parameter estimates are coupled G.Naïve Bayes vs. Logistic Regression Recall two assumptions deriving form of LR from GNBayes: 1. X i conditionally independent of X k given Y 2. P(X i Y = y k ) = N(µ ik,σ i ), ß not N(µ ik,σ ik ) Consider three learning methods: GNB (assumption 1 only) GNB2 (assumption 1 and 2) LR Which method works better if we have infinite training data, and Both (1) and (2) are satisfied Neither (1) nor (2) is satisfied (1) is satisfied, but not (2) 13

G.Naïve Bayes vs. Logistic Regression Recall two assumptions deriving form of LR from GNBayes: 1. X i conditionally independent of X k given Y 2. P(X i Y = y k ) = N(µ ik,σ i ), ß not N(µ ik,σ ik ) Consider three learning methods: GNB (assumption 1 only) GNB2 (assumption 1 and 2) LR Which method works better if we have infinite training data, and... Both (1) and (2) are satisfied Neither (1) nor (2) is satisfied (1) is satisfied, but not (2) [Ng & Jordan, 2002] G.Naïve Bayes vs. Logistic Regression [Ng & Jordan, 2002] Recall two assumptions deriving form of LR from GNBayes: 1. X i conditionally independent of X k given Y 2. P(X i Y = y k ) = N(µ ik,σ i ), ß not N(µ ik,σ ik ) Consider three learning methods: GNB (assumption 1 only) -- decision surface can be non-linear GNB2 (assumption 1 and 2) decision surface linear LR -- decision surface linear, trained differently Which method works better if we have infinite training data, and... Both (1) and (2) are satisfied: LR = GNB2 = GNB Neither (1) nor (2) is satisfied: LR > GNB2, GNB>GNB2 (1) is satisfied, but not (2) : GNB > LR, LR > GNB2 14

G.Naïve Bayes vs. Logistic Regression [Ng & Jordan, 2002] What if we have only finite training data? They converge at different rates to their asymptotic ( data) error Let refer to expected error of learning algorithm A after n training examples Let d be the number of features: <X 1 X d > So, GNB requires n = O(log d) to converge, but LR requires n = O(d) Some experiments from UCI data sets [Ng & Jordan, 2002] 15

Naïve Bayes vs. Logistic Regression The bottom line: GNB2 and LR both use linear decision surfaces, GNB need not Given infinite data, LR is better or equal to GNB2 because training procedure does not make assumptions 1 or 2 (though our derivation of the form of P(Y X) did). But GNB2 converges more quickly to its perhaps-less-accurate asymptotic error And GNB is both more biased (assumption1) and less (no assumption 2) than LR, so either might beat the other What you should know: Logistic regression Functional form follows from Naïve Bayes assumptions For Gaussian Naïve Bayes assuming variance σ i,k = σ i For discrete-valued Naïve Bayes too But training procedure picks parameters without making conditional independence assumption MLE training: pick W to maximize P(Y X, W) MAP training: pick W to maximize P(W X,Y) regularization helps reduce overfitting Gradient ascent/descent General approach when closed-form solutions unavailable Generative vs. Discriminative classifiers Bias vs. variance tradeoff 16