Machine Learning Gaussian Naïve Bayes Big Picture

Similar documents
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Machine Learning

Machine Learning

Machine Learning

Generative v. Discriminative classifiers Intuition

Bias-Variance Tradeoff

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

ECE 5984: Introduction to Machine Learning

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning, Fall 2012 Homework 2

Support Vector Machines

Machine Learning

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Stochastic Gradient Descent

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Estimating Parameters

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Discriminative v. generative

Machine Learning

Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Logistic Regression. Machine Learning Fall 2018

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

10-701/ Machine Learning - Midterm Exam, Fall 2010

Introduction to Machine Learning

Classification Logistic Regression

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

CSC 411: Lecture 09: Naive Bayes

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Week 5: Logistic Regression & Neural Networks

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Naïve Bayes classification

Gaussians Linear Regression Bias-Variance Tradeoff

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Linear classifiers: Overfitting and regularization

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Bayesian Learning (II)

The Naïve Bayes Classifier. Machine Learning Fall 2017

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Logistic Regression. William Cohen

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Support Vector Machines

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

MIDTERM SOLUTIONS: FALL 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

ECE521 week 3: 23/26 January 2017

MLE/MAP + Naïve Bayes

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Hidden Markov Models

Linear classifiers: Logistic regression

PAC-learning, VC Dimension and Margin-based Bounds

Ad Placement Strategies

Lecture 2 Machine Learning Review

Introduction to Bayesian Learning. Machine Learning Fall 2018

PAC-learning, VC Dimension and Margin-based Bounds

Machine Learning (CSE 446): Probabilistic Machine Learning

Linear discriminant functions

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Probabilistic Machine Learning. Industrial AI Lab.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Generative Learning algorithms

An Introduction to Statistical and Probabilistic Linear Models

The Bayes classifier

ECE521 lecture 4: 19 January Optimization, MLE, regularization

Voting (Ensemble Methods)

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Logistic Regression Logistic

Generative classifiers: The Gaussian classifier. Ata Kaban School of Computer Science University of Birmingham

CPSC 340: Machine Learning and Data Mining

Hidden Markov Models

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning - Waseda University Logistic Regression

SVMs, Duality and the Kernel Trick

Gaussian and Linear Discriminant Analysis; Multiclass Classification

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Machine Learning, Midterm Exam

Machine Learning Basics Lecture 4: SVM I. Princeton University COS 495 Instructor: Yingyu Liang

Bayesian Methods: Naïve Bayes

Introduction to Machine Learning

Transcription:

Machine Learning 10-701 Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 27, 2011 Today: Naïve Bayes Big Picture Logistic regression Gradient ascent Generative discriminative classifiers Readings: Required: Mitchell: Naïve Bayes and Logistic Regression (see class website) Optional Ng and Jordan paper (class website) Gaussian Naïve Bayes Big Picture Consider boolean Y, continuous X i. Assume P(Y=1)=0.5 1

What is the minimum possible error? Best case: conditional independence assumption is satistied we know P(Y), P(X Y) perfectly (e.g., infinite training data) Logistic Regression Idea: Naïve Bayes allows computing P(Y X) by learning P(Y) and P(X Y) Why not learn P(Y X) directly? 2

Consider learning f: X Y, where X is a vector of real-valued features, < X 1 X n > Y is boolean assume all X i are conditionally independent given Y model P(X i Y = y k ) as Gaussian N(µ ik,σ i ) model P(Y) as Bernoulli (π) What does that imply about the form of P(Y X)? Derive form for P(Y X) for continuous X i 3

Very convenient! implies implies implies Very convenient! implies implies implies linear classification rule! 4

Logistic function 5

Logistic regression more generally Logistic regression when Y not boolean (but still discrete-valued). Now y {y 1... y R } : learn R-1 sets of weights for k<r for k=r Training Logistic Regression: MCLE we have L training examples: maximum likelihood estimate for parameters W maximum conditional likelihood estimate 6

Training Logistic Regression: MCLE Choose parameters W=<w 0,... w n > to maximize conditional likelihood of training data where Training data D = Data likelihood = Data conditional likelihood = Expressing Conditional Log Likelihood 7

Maximizing Conditional Log Likelihood Good news: l(w) is concave function of W Bad news: no closed-form solution to maximize l(w) 8

Maximize Conditional Log Likelihood: Gradient Ascent Maximize Conditional Log Likelihood: Gradient Ascent Gradient ascent algorithm: iterate until change < ε For all i, repeat 9

That s all for M(C)LE. How about MAP? One common approach is to define priors on W Normal distribution, zero mean, identity covariance Helps avoid very large weights and overfitting MAP estimate let s assume Gaussian prior: W ~ N(0, σ) MLE vs MAP Maximum conditional likelihood estimate Maximum a posteriori estimate with prior W~N(0,σI) 10

MAP estimates and Regularization Maximum a posteriori estimate with prior W~N(0,σI) called a regularization term helps reduce overfitting, especially when training data is sparse keep weights nearer to zero (if P(W) is zero mean Gaussian prior), or whatever the prior suggests used very frequently in Logistic Regression The Bottom Line Consider learning f: X Y, where X is a vector of real-valued features, < X 1 X n > Y is boolean assume all X i are conditionally independent given Y model P(X i Y = y k ) as Gaussian N(µ ik,σ i ) model P(Y) as Bernoulli (π) Then P(Y X) is of this form, and we can directly estimate W Furthermore, same holds if the X i are boolean trying proving that to yourself 11

Generative vs. Discriminative Classifiers Training classifiers involves estimating f: X Y, or P(Y X) Generative classifiers (e.g., Naïve Bayes) Assume some functional form for P(X Y), P(X) Estimate parameters of P(X Y), P(X) directly from training data Use Bayes rule to calculate P(Y X= x i ) Discriminative classifiers (e.g., Logistic regression) Assume some functional form for P(Y X) Estimate parameters of P(Y X) directly from training data Use Naïve Bayes or Logisitic Regression? Consider Restrictiveness of modeling assumptions Rate of convergence (in amount of training data) toward asymptotic hypothesis i.e., the learning curve 12

Naïve Bayes vs Logistic Regression Consider Y boolean, X i continuous, X=<X 1... X n > Number of parameters to estimate: NB: LR: Naïve Bayes vs Logistic Regression Consider Y boolean, X i continuous, X=<X 1... X n > Number of parameters: NB: 4n +1 LR: n+1 Estimation method: NB parameter estimates are uncoupled LR parameter estimates are coupled 13

G.Naïve Bayes vs. Logistic Regression Generative and Discriminative classifiers [Ng & Jordan, 2002] Asymptotic comparison (# training examples infinity) when conditional independence assumptions correct GNB, LR produce identical classifiers when conditional independence assumptions incorrect LR is less biased does not assume cond indep. therefore expected to outperform GNB when both given infinite training data Naïve Bayes vs. Logistic Regression Generative and Discriminative classifiers Non-asymptotic analysis (see [Ng & Jordan, 2002] ) convergence rate of parameter estimates how many training examples needed to assure good estimates? GNB order log n (where n = # of attributes in X) LR order n GNB converges more quickly to its (perhaps less accurate) asymptotic estimates Informally: because LR s parameter estimates are coupled, but GNB s are not 14

Some experiments from UCI data sets [Ng & Jordan, 2002] Summary: Naïve Bayes and Logistic Regression Modeling assumptions Naïve Bayes more biased (cond. indep) Both learn linear decision surfaces Convergence rate (n=number training examples) Naïve Bayes ~ O(log n) Logistic regression ~O(n) Bottom line Naïve Bayes converges faster to its (potentially too restricted) final hypothesis 15

What you should know: Logistic regression Functional form follows from Naïve Bayes assumptions For Gaussian Naïve Bayes assuming variance σ i,k = σ i For discrete-valued Naïve Bayes too But training procedure picks parameters without the conditional independence assumption MLE training: pick W to maximize P(Y X, W) MAP training: pick W to maximize P(W X,Y) regularization: e.g., P(W) ~ N(0,σ) helps reduce overfitting Gradient ascent/descent General approach when closed-form solutions for MLE, MAP are unavailable Generative vs. Discriminative classifiers Bias vs. variance tradeoff 16