Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Similar documents
Naïve Bayes classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Machine Learning, Fall 2012 Homework 2

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CSC 411: Lecture 09: Naive Bayes

Bayesian Methods: Naïve Bayes

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Bias-Variance Tradeoff

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Generative v. Discriminative classifiers Intuition

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

The Naïve Bayes Classifier. Machine Learning Fall 2017

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Generative v. Discriminative classifiers Intuition

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Generative Learning algorithms

Generative Clustering, Topic Modeling, & Bayesian Inference

Machine Learning Gaussian Naïve Bayes Big Picture

CS 188: Artificial Intelligence Spring Today

MLE/MAP + Naïve Bayes

CPSC 340: Machine Learning and Data Mining

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

Logistic Regression. Machine Learning Fall 2018

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

COMP 328: Machine Learning

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Machine Learning

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Lecture 2: Logistic Regression and Neural Networks

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Stephen Scott.

CSCI-567: Machine Learning (Spring 2019)

Estimating Parameters

Lecture 3 Classification, Logistic Regression

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Machine Learning

MLE/MAP + Naïve Bayes

Bayesian Learning (II)

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Bayesian Inference and an Intro to Monte Carlo Methods

Probability Review and Naïve Bayes

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Week 5: Logistic Regression & Neural Networks

ECE521 week 3: 23/26 January 2017

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

PATTERN RECOGNITION AND MACHINE LEARNING

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Machine Learning

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Machine Learning for Signal Processing Bayes Classification

Lecture 4 Logistic Regression

CMU-Q Lecture 24:

Probabilistic Classification

Introduction to Bayesian Learning. Machine Learning Fall 2018

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Introduction to Machine Learning Midterm Exam

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

Machine Learning

Introduction to Machine Learning

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Logistic Regression. William Cohen

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Bayesian Classifiers and Probability Estimation. Vassilis Athitsos CSE 4308/5360: Artificial Intelligence I University of Texas at Arlington

CSC 411: Lecture 04: Logistic Regression

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Machine Learning Linear Classification. Prof. Matteo Matteucci

Mathematical Formulation of Our Example

Bayesian Decision Theory

Machine Learning for Signal Processing Bayes Classification and Regression

Click Prediction and Preference Ranking of RSS Feeds

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Data Mining Part 4. Prediction

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Bayesian Learning Features of Bayesian learning methods:

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Introduction to Machine Learning

Linear Models for Classification

COMP 551 Applied Machine Learning Lecture 5: Generative models for linear classification

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

ECE521 Lecture7. Logistic Regression

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

DEPARTMENT OF COMPUTER SCIENCE Autumn Semester MACHINE LEARNING AND ADAPTIVE INTELLIGENCE

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 4 of Data Mining by I. H. Witten, E. Frank and M. A.

Statistical Data Mining and Machine Learning Hilary Term 2016

Transcription:

Generative Classifiers: Part 1 CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang 1

This Week Discriminative vs Generative Models Simple Model: Does the patient have cancer? Discrete test Continuous test Naïve Bayes: Spam filtering example Continuous features Naïve Bayes and Logistic Regression 2

Discriminative vs Generative Models Build a classifier that learns a decision boundary Classify test data by seeing which side of the boundary it falls into. (what are some examples we ve seen?) Build a model for what the data in each class looks like (i.e., the distribution of the data) Classify test data by comparing it against the models for the data in the two classes. 3

Discriminative vs Generative Models Build a classifier that learns a decision boundary Build a model of what data in each class looks like Models p y x directly Models p x y for each value of y and p y and then use Bayes rule to find p y x 4

Example of p(x y) P x 1, x 2 y = blue = N(x 1 μ = 1, σ 2 = 1) N(x 2 μ = 1, σ 2 = 1) Product of normal densities for both x 1 and x 2 Highest density at (1, 1), lower densities far away from (1,1) Wouldn t work for the red crosses: the probability distribution doesn t seem symmetric 5

Example of p(y) P y = blue = 12 14, P y = red = 12+14 12+14 6

Does the patient have cancer? We perform a lab test and the result comes back positive The test comes back positive in 98% of cases where the cancer is present The test comes back negative in 97% of cases where there is no cancer 0.008 of the population has cancer And the cancer screening was random (Why is this important?) 7

Generative process for data P cancer = 0.008 P + cancer =.98 P cancer =.97 θ θ determines how the test results are generated Person i has cancer with prob. 0.008 The probability of a positive test for person i depends on whether they have cancer or not 8

P cancer = 0.008 P + cancer =.98 P cancer =.97 P cancer + = P + cancer P(cancer) P(+) = P + cancer P(cancer) P + cancer P cancer + P + cancer P( cancer) 9

Learning a Generative Model For the cancer data, just count the number of points in the training set (of size N) belonging to each category P cancer count(cancer,) N P + cancer count(cancer,+) count(cancer) (Could get P cancer + by counting as well) 10

Gaussian Classifiers Suppose the test actually outputs a real number t P cancer count(cancer) N P t cancer = N t μ cancer, σ cancer P t cancer = N t μ cancer, σ 2 cancer θ = {μ cancer, μ cancer, σ cancer, σ cancer, } θ determines how the test results are generated Decide whether person i has cancer (with prob P(cancer)) Now generate the test output t What s the probability that the person has cancer, if we know θ? P θ cancer t = P θ t cancer P θ (cancer) P θ t cancer P θ cancer +P θ t cancer P θ cancer 2 Learn using maximum likelihood 11

Learning a Gaussian with Maximum Likelihood We have all the t s for patients with cancer Maximum Likelihood: argmax μc,σ c 2 ς i N(t (i) μ c, σ c 2 ), i cancer Solution (show with calculus!): μ c = t (i), i cancer σ 2 t c = σ i 2 μ c i, i cancer #ncancer 12

Classification of new instances Suppose don t know θ What s P(cancer t, D)? Not P θmap (cancer t, D)! 13

Classification of new instances Suppose we are estimating θ from the data What s P(cancer t, D)? σ θ Θ P cancer θ, t P θ D = σ θ Θ P θ cancer t P θ D Intuition: consider all the possible θ, compute the probability according to each of them, and weight them by how much we believe that the true θ could be θ 14

Suppose we are estimating θ from the data What s P(cancer t)? P θmap cancer t is not a horrible estimate here 15

Multidimensional features In the previous example, x was scale (onedimensional) In general, features x can be high-dimensional. Can do something similar: P(y = c x 1,..., x p ) = P y=c P x 1,..., x p y = c σ c P y=c P x 1,..., x p y = c But now, P x 1,..., x p y = c is harder to model More on this later 16

Naïve Bayes Assume the input features are conditionally independent given the class: p P x 1,..., x p y = c = Π i=1 p x i y = c Note: this is different from unconditional independence! Fairly strong assumption, but works quite well in practice A model that is not right can still be useful All models are wrong, but some are useful George P. Box 17

Naïve Bayes classifier c = argmax c P(y = c) ς i P(x i y = c) If x i are discrete, learn P(x class) using P x i = 1 c count(x i = 1, c) count(c) I.e., count how many times the attribute appears in emails of class c 18

Spam Filtering Examples Email Spam? send us your password Y send us your review N review your password N review us Y send your password Y send us your account Y review us now???? Observe attributes x 1, x 2,, x n (e.g., keyword 1, 2, 3, are present in the email, respectively) Goal: classify email as spam or non-spam Bag of words features: features based on the appearance of keywords in the text, disregarding the order of the words 19

Naïve Bayes Spam Filter Email Spam? send us your password Y send us your review N review your password N review us Y send your password Y send us your account Y review us now???? p(spam) = 4/6 p(w spam) password 2/4 1/2 review 1/4 2/2 send 3/4 1/2 us 3/4 1/2 your 3/4 1/2 account 1/4 0/2 p(notspam) = 2/6 p(w notspam) 20

Naïve Bayes Spam Filter Email send us your password send us your review review your password review us send your password send us your account review us now???? p(review us spam) = (1-2/4)(1/4)(1-3/4)(3/4)(1-3/4)(1-1/4) p(review us notspam) = (1-1/2)(2/2)(1-1/2)(1/2)(1-1/2)(1-0/2) p(spam review us) = p(not spam review us) = Spam? Y N N Y Y Y p(spam) = 4/6 p(w spam) password 2/4 1/2 review 1/4 2/2 send 3/4 1/2 us 3/4 1/2 your 3/4 1/2 account 1/4 0/2 p(review us spam)p(spam) p(notspam) = 2/6 p(w notspam) p(review us spam)p(spam) + p(review us notspam)p(notspam) p review us notspam)p(notspam) p(review us spam)p(spam) + p(review us notspam)p(notspam) 21

Naïve Bayes Classification Prediction: c MAP = argmax c P c x 1, x 2,.., x n = argmax c P x 1,, x n c P(c) P(x 1,, x n ) = argmax c P x 1,, x n c P c 22

Naïve Bayes classifier: Why? Can t estimate P x 1,, x n c using counts Most counts would be zero! curse of dimensionality What if count(x i = 1, c) is 0? We would never assign class c to examples with x i = 1 So use P x i = 1 c count x i=1,c +m p count c +m m is a parameter Interpretation: the number of virtual examples added to the training set Prior estimate forp x i = 1 c 23

Naïve Bayes classifier Pros: Fast to train (single pass through data) Fast to test Less overfitting Easy to add/remove classes Handles partial data Cons: When naïve i.i.d. assumption does not hold, can perform much worse 24

Naïve Bayes assumptions Conditional independence: Appearances of loan and diploma need not be statistically independent in general This is useful! If both are indicative of the email being spam, they would not be statistically independent, since both are likely to appear in a spam email what we are counting on If we know the email is a spam email, the fact that dog appears in it doesn t make it more or less likely that homework will appear in Does this sound like it would be true? Effect of conditional non-independence homework and assignment would count as two pieces of evidence for classifying the email, even though the second keyword doesn t add any information 25

Naïve Bayes vs Logistic Regression Compute the log-odds of y = c given the inputs using NB: We can write this as: With β 0 = log log P y = c x 1,, x p P y = c x 1,, x p P y = c = log P y = c = log P y = c Π jp x j c P y = c Π j P x j c + log P y = c x 1,, x p P y = c x 1,, x p P y=c P y=c And β j = log P x j = 1 y = c P x j = 1 y = c + σ j log P x j = 0 y = c P x j = 0 y = c j log P x j y = c P x j y = c = β 0 + β j x j log P x j = 0 y = c P x j = 0 y = c j 26

Naïve Bayes vs Logistic Regression Naïve Bayes model: log P y = c x 1,, x p P y = c x 1,, x p = β 0 + β j x j for specific β j s. Same form as the Logistic Regression model, but in LR the β j are whatever maximizes the likelihood We assume x j {0, 1} What if x j are continuous? j 27

Naïve Bayes vs Logistic Regression Assume x j ~ N μ c,j, σ j 2 Define β j = (μ c,j μ c,j ) /σ j 2 to again get: log P y = c x 1,, x p P y = c x 1,, x p = β 0 + β j x j j 28

Naïve Bayes vs Logistic Regression log P y = c x 1,, x p P y = c x 1,, x p = β 0 + σ j β j x j Naïve Bayes: the ML estimate for the model if we assume that the features are independent given the class label Definitely better if the conditional independence assumption is true Possibly better if we want to constrain the model capacity and prevent overfitting Logistic regression: the ML estimate for the model More capacity: arbitrary β j s allowed. Might be able to fit the data better More likely to overfit 29