Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Similar documents
Generative v. Discriminative classifiers Intuition

Bias-Variance Tradeoff

Generative v. Discriminative classifiers Intuition

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

ECE 5984: Introduction to Machine Learning

Machine Learning Gaussian Naïve Bayes Big Picture

Generative v. Discriminative classifiers Intuition

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Machine Learning

CSE546: Naïve Bayes Winter 2012

Machine Learning

Bayesian Methods: Naïve Bayes

Machine Learning

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Stochastic Gradient Descent

Machine Learning, Fall 2012 Homework 2

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Classification Logistic Regression

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

CSE446: Naïve Bayes Winter 2016

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Dimensionality reduction

Estimating Parameters

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Logistic Regression Logistic

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Logistic Regression. Machine Learning Fall 2018

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Support Vector Machines

Introduction to Machine Learning

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

ECE521 week 3: 23/26 January 2017

Introduc)on to Bayesian methods (con)nued) - Lecture 16

The Naïve Bayes Classifier. Machine Learning Fall 2017

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

CSE446: Clustering and EM Spring 2017

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Expectation Maximization Algorithm

Bayesian Learning (II)

Generative Clustering, Topic Modeling, & Bayesian Inference

MLE/MAP + Naïve Bayes

Gaussian and Linear Discriminant Analysis; Multiclass Classification

An Introduction to Statistical and Probabilistic Linear Models

Ad Placement Strategies

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Logistic Regression. William Cohen

Introduction to Bayesian Learning. Machine Learning Fall 2018

Voting (Ensemble Methods)

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

CSC 411: Lecture 09: Naive Bayes

Generative Models for Discrete Data

Announcements. Proposals graded

Hidden Markov Models

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

MLE/MAP + Naïve Bayes

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

10-701/ Machine Learning - Midterm Exam, Fall 2010

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

Generative Learning algorithms

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine Learning CSE546 Carlos Guestrin University of Washington. October 7, Efficiency: If size(w) = 100B, each prediction is expensive:

Logistic Regression & Neural Networks

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 9 Sep. 26, 2018

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Day 5: Generative models, structured classification

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

CSE 473: Artificial Intelligence Autumn Topics

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Is the test error unbiased for these programs?

CSCI-567: Machine Learning (Spring 2019)

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Linear discriminant functions

Hidden Markov Models

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Lecture 2 Machine Learning Review

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine Learning Lecture 7

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

Week 5: Logistic Regression & Neural Networks

ECE 5424: Introduction to Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Naïve Bayes classification

Classification Logistic Regression

Algorithmisches Lernen/Machine Learning

Discriminative v. generative

Transcription:

CSE 446 Gaussian Naïve Bayes & Logistic Regression Winter 22 Dan Weld Learning Gaussians Naïve Bayes Last Time Gaussians Naïve Bayes Logistic Regression Today Some slides from Carlos Guestrin, Luke Zettlemoyer 2 Text Classification Bag of Words Representation aardvark about 2 all 2 Africa apple anxious... gas... oil Zaire Use Bayes rule: Bayesian Learning Prior Data Likelihood Posterior P(Y X) = P(X Y) P(Y) P(X) Normalization Or equivalently: P(Y X) P(X Y) P(Y) Naïve Bayes Naïve Bayes assumption: Features are independent given class: The Distributions We Love Discrete Binary {, } k Values Continuous More generally: How many parameters now? Suppose X is composed of n binary features Single Event Bernouilli Sequence (N trials) Binomial Multinomial N= H + T Conjugate Beta Dirichlet Prior 6

NB with Bag of Words for Text Classification Learning phase: Prior P(Y m ) Count how many documents from topic m / total # docs P(X i Y m ) Let B m be a bag of words formed from all the docs in topic m Let #(i, B) be the number of times word iis in bag B P(X i Y m ) = (#(i, B m )+) / (W+ j #(j, B m )) where W=#unique words Test phase: For each document Use naïve Bayes decision rule But Easy to Implement If you do it probably won t work 8 Probabilities: Important Detail! P(spam X X n ) = P(spam X i ) i Any more potential problems here? We are multiplying lots of small numbers Danger of underflow!.5 57 = 7 E -8 Solution? Use logs and add! p * p 2 = e log(p)+log(p2) Always keep in log form Naïve Bayes Posterior Probabilities Classification results of naïve Bayes I.e. the class with maximum posterior probability Usually fairly accurate (?!?!?) However, due to the inadequacy of the conditional independence assumption Actual posterior probability estimates not accurate. Output probabilities generally very close to or. Twenty News Groups results Learning curve for Twenty News Groups 2

Bayesian Learning What if Features are Continuous? Bayesian Learning What if Features are Continuous? Eg., Character Recognition: X i is i th pixel Prior Eg., Character Recognition: X i is i th pixel Posterior P(Y X) P(X Y) P(Y) Data Likelihood P(X i =x Y=y k ) = N( ik, ik ) N( ik, ik ) = P(Y X) P(X Y) P(Y) Gaussian Naïve Bayes Sometimes Assume Variance is independent of Y (i.e., i ), or independent of X i (i.e., k ) or both (i.e., ) P(X i =x Y=y k ) = N( ik, ik ) P(Y X) P(X Y) P(Y) Learning Gaussian Parameters Maximum Likelihood Estimates: Mean: Variance: N( ik, ik ) = Learning Gaussian Parameters Learning Gaussian Parameters Maximum Likelihood Estimates: Mean: j th training example Maximum Likelihood Estimates: Mean: Variance: x if x true, else Variance: 3

Example: GNB for classifying mental states [Mitchell et al.] ~ mm resolution ~2 images per sec. 5, voxels/image non-invasive, safe sec Brain scans can track activation with precision and sensitivity measures Blood Oxygen Level Dependent (BOLD) response Typical impulse response [Mitchell et al.] Gaussian Naïve Bayes: Learned voxel,word P(BrainActivity WordCategory = {People,Animal}) [Mitchell et al.] Gaussian Naïve Bayes: Learned voxel,word P(BrainActivity WordCategory = {People,Animal}) Pairwise classification accuracy: 85% People words Animal words [Mitchell et al.] What You Need to Know about Naïve Bayes Optimal Decision using Bayes Classifier Naïve Bayes Classifier What s the assumption Why we use it How do we learn it Text Classification Bag of words model Gaussian NB Features still conditionally independent Features have Gaussian distribution given class What s (supervised) learning more formally Given: Dataset: Instances { x ;t(x ),, x N ;t(x N ) } e.g., x i ;t(x i ) = (GPA=3.9,IQ=2,MLscore=99);5K Hypothesis space: H e.g., polynomials of degree 8 Loss function: measures quality of hypothesis h H e.g., squared error for regression Obtain: Learning algorithm: obtain h H that minimizes loss function e.g., using closed form solution if available Or greedy search if not Want to minimize prediction error, but can only minimize error in dataset 24 4

Types of (supervised) learning problems, revisited Decision Trees, e.g., dataset: votes; party hypothesis space: Loss function: NB Classification, e.g., dataset: brain image; {verb v. noun} hypothesis space: Loss function: Density estimation, e.g., dataset: grades hypothesis space: Loss function: Learning is (simply) function approximation! The general (supervised) learning problem: Given some data (including features), hypothesis space, loss function Learning is no magic! Simply trying to find a function that fits the data Regression Density estimation Classification (Not surprisingly) Seemly different problem, very similar solutions 25 Carlos Guestrin 25 29 26 What you need to know about supervised learning Learning is function approximation What functions are being optimized? Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features Y target classes Bayes optimal classifier P(Y X) Generative classifier, e.g., Naïve Bayes: Assume some functional form for P(X Y), P(Y) Estimate parameters of P(X Y), P(Y) directly from training data Use Bayes rule to calculate P(Y X= x) This is a generative model Indirect computation of P(Y X) through Bayes rule As a result, can also generate a sample of the data, P(X) = y P(y) P(X y) Discriminative classifiers, e.g., Logistic Regression: Assume some functional form for P(Y X) Estimate parameters of P(Y X) directly from training data This is the discriminative model Directly learn P(Y X) But cannot obtain a sample of the data, because P(X) is not available 27 28 Learn P(Y X) directly! Logistic Regression Assume a particular functional form Not differentiable Learn P(Y X) directly! Logistic Regression Assume a particular functional form Logistic Function Aka Sigmoid P(Y)= P(Y)= 29 3 5

Logistic Function in n Dimensions Understanding Sigmoids Sigmoid applied to a linear function of the data:.9.8 w =-2, w =-.7.6.5.4.8.6.4.2-4 -2-2 4 2 x 2 6 2 x 4 8 6.3.2..9.8.7.6.5.4.3.2. -6-4 -2 2 4 6.9.8.7.6.5.4.3.2. Features can be discrete or continuous! 3-6 -4-2 2 4 6 w =, w =- -6-4 -2 2 4 6 w =, w = -.5 implies Very convenient! Loss functions: Likelihood vs. Conditional Likelihood Generative (Naïve Bayes) Loss function: Data likelihood implies Discriminative models cannot compute P(x j w)! But, discriminative (logistic regression) loss function: Conditional Data Likelihood implies linear classification rule! Doesn t waste effort learning P(X) focuses on P(Y X) all that matters for classification 33 35 Expressing Conditional Log Likelihood Maximizing Conditional Log Likelihood Good news: l(w) is concave function of w! no locally optimal solutions Bad news: no closed-form solution to maximize l(w) Good news: concave functions easy to optimize 36 37 6

Optimizing concave function Gradient ascent Conditional likelihood for Logistic Regression is concave! Find optimum with gradient ascent Maximize Conditional Log Likelihood: ascent Gradient Gradient: Update rule: Learning rate, > Gradient ascent is simplest of optimization approaches e.g., Conjugate gradient ascent much better (see reading) 38 39 Gradient Descent for LR That s all M(C)LE. How about MAP? Gradient ascent algorithm: iterate until change < For i=,,n, One common approach is to define priors on w Normal distribution, zero mean, identity covariance Pushes parameters towards zero Corresponds to Regularization Helps avoid very large weights and overfitting More on this later in the semester MAP estimate repeat 4 4 M(C)AP as Regularization Large parameters Overfitting If data is linearly separable, weights go to infinity Leads to overfitting: Penalizes high weights, also applicable 42 in linear regression Penalizing high weights can prevent overfitting again, more on this later in the semester 43 7

Gradient of M(C)AP MLE vs MAP Maximum conditional likelihood estimate Maximum conditional a posteriori estimate 44 45 Logistic regression v. Naïve Bayes Derive form for P(Y X) for continuous X i Consider learning f: X Y, where X is a vector of real valued features, < X Xn > Y is boolean Could use a Gaussian Naïve Bayes classifier assume all X i are conditionally independent given Y model P(X i Y = y k ) as Gaussian N( ik, i ) model P(Y) as Bernoulli(, ) What does that imply about the form of P(Y X)? Cool!!!! 46 47 Ratio of class conditional probabilities Derive form for P(Y X) for continuous X i 48 49 8

Gaussian Naïve Bayes v. Logistic Regression Naïve Bayes vs Logistic Regression Set of Gaussian Naïve Bayes parameters (feature variance independent of class label) Set of Logistic Regression parameters Consider Y boolean, X i continuous, X=<X... X n > Number of parameters: NB: 4n + LR: n+ Representation equivalence But only in a special case!!! (GNB with class independent variances) But what s the difference??? LR makes no assumptions about P(X Y) in learning!!! Loss function!!! Optimize different functions! Obtain different solutions 5 Estimation method: NB parameter estimates are uncoupled LR parameter estimates are coupled 5 G. Naïve Bayes vs. Logistic Regression Generative and Discriminative classifiers [Ng & Jordan, 22] Asymptotic comparison (# training examples infinity) when model correct GNB, LR produce identical classifiers when model incorrect LR is less biased does not assume conditional independence therefore LR expected to outperform GNB 52 G. Naïve Bayes vs. Logistic Regression 2 Generative and Discriminative classifiers [Ng & Jordan, 22] Non asymptotic analysis convergence rate of parameter estimates, n = # of attributes in X Size of training data to get close to infinite data solution GNB needs O(log n) samples LR needs O(n) samples GNB converges more quickly to its (perhaps less helpful) asymptotic estimates 53 Some experiments from UCI data sets 54 Naïve bayes Logistic Regression What you should know about Logistic Regression (LR) Gaussian Naïve Bayes with class independent variances representationally equivalent to LR Solution differs because of objective (loss) function In general, NB and LR make different assumptions NB: Features independent given class! assumption on P(X Y) LR: Functional form of P(Y X), no assumption on P(X Y) LR is a linear classifier decision rule is a hyperplane LR optimized by conditional likelihood no closed form solution concave! global optimum with gradient ascent Maximum conditional a posteriori corresponds to regularization Convergence rates GNB (usually) needs less data LR (usually) gets to better solutions in the limit 55 9