Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Similar documents
Generative Clustering, Topic Modeling, & Bayesian Inference

Probability Basics. Part 3: Types of Probability. INFO-1301, Quantitative Reasoning 1 University of Colorado Boulder

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Introduction to Bayesian Learning. Machine Learning Fall 2018

Naïve Bayes classification

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

CPSC 340: Machine Learning and Data Mining

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

MLE/MAP + Naïve Bayes

Logistic Regression. Machine Learning Fall 2018

The Naïve Bayes Classifier. Machine Learning Fall 2017

Generative Learning algorithms

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Lecture : Probabilistic Machine Learning

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

CSC321 Lecture 18: Learning Probabilistic Models

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Machine Learning Linear Classification. Prof. Matteo Matteucci

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Soft Computing. Lecture Notes on Machine Learning. Matteo Mattecci.

Introduction to Logistic Regression and Support Vector Machine

Mathematical Formulation of Our Example

Midterm sample questions

Bias-Variance Tradeoff

Machine Learning and Data Mining. Bayes Classifiers. Prof. Alexander Ihler

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

CMU-Q Lecture 24:

CSCI-567: Machine Learning (Spring 2019)

CS 188: Artificial Intelligence Spring Today

ECE 5984: Introduction to Machine Learning

Probability Review and Naïve Bayes

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Machine Learning (CSE 446): Multi-Class Classification; Kernel Methods

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

An Introduction to Statistical and Probabilistic Linear Models

Bayesian Methods: Naïve Bayes

Generative v. Discriminative classifiers Intuition

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Machine Learning, Fall 2012 Homework 2

MLE/MAP + Naïve Bayes

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

CS221 / Autumn 2017 / Liang & Ermon. Lecture 15: Bayesian networks III

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

ECE521 week 3: 23/26 January 2017

Loss Functions, Decision Theory, and Linear Models

Introduction to Machine Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

The Bayes classifier

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Introduction to Machine Learning

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Stochastic Gradient Descent

Support Vector Machines

Probabilistic Graphical Models

Bayesian Classification Methods

Learning from Data. Amos Storkey, School of Informatics. Semester 1. amos/lfd/

Notes on Discriminant Functions and Optimal Classification

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

The classifier. Theorem. where the min is over all possible classifiers. To calculate the Bayes classifier/bayes risk, we need to know

The classifier. Linear discriminant analysis (LDA) Example. Challenges for LDA

CS6220: DATA MINING TECHNIQUES

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Overfitting, Bias / Variance Analysis

Reinforcement Learning Wrap-up

Bayesian Learning (II)

Generative v. Discriminative classifiers Intuition

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Introduction to Machine Learning

Some Probability and Statistics

Bayesian Approaches Data Mining Selected Technique

Lecture 9: Classification, LDA

Lecture 2 Machine Learning Review

Lecture 9: Classification, LDA

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Lecture 9: Classification, LDA

UVA CS 4501: Machine Learning

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Tufts COMP 135: Introduction to Machine Learning

Classification 2: Linear discriminant analysis (continued); logistic regression

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

DATA MINING LECTURE 10

CPSC 540: Machine Learning

Machine Learning

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Lecture 1: Bayesian Framework Basics

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Bayesian Learning Extension

Transcription:

Generative Learning INFO-4604, Applied Machine Learning University of Colorado Boulder November 29, 2018 Prof. Michael Paul

Generative vs Discriminative The classification algorithms we have seen so far are called discriminative algorithms Learn to discriminate (i.e., distinguish/separate) between classes Generative algorithms learn the characteristics of each class Then make a prediction of an instance based on which class it best matches Generative models can also be used to randomly generate instances of a class

Generative vs Discriminative A high-level way to think about the difference: Generative models use absolute descriptions of classes and discriminative models use relative descriptions Example: classifying cats vs dogs Generative perspective: Cats weigh 10 pounds on average Dogs weigh 50 pounds on average Discriminative perspective: Dogs weigh 40 pounds more than cats on average

Generative vs Discriminative The difference between the two is often defined probabilistically: Generative models: Algorithms learn P(X Y) Then convert to P(Y X) to make prediction Discriminative models: Algorithms learn P(Y X) Probability can be directly used for prediction

Generative vs Discriminative While discriminative models are not often probabilistic (but can be, like logistic regression), generative models usually are.

Example Classify cat vs dog based on weight Cats have a mean weight of 10 pounds (stddev 2) Dogs have a mean weight of 50 pounds (stddev 20) Could model the probability of the weight with a normal distribution Normal(10, 2) distribution for cats, Normal(50, 20) for dogs This is a distribution of probability density, but will refer to this as probability in this lecture

Example Classify an animal that weighs 14 pounds P(weight=14 animal=cat) =.027 P(weight=14 animal=dog) =.004

Example Classify an animal that weighs 14 pounds P(weight=14 animal=cat) =.027 Choosing the Y that gives the highest P(X Y) is reasonable but not quite the right thing to do P(weight=14 animal=dog) =.004 What if dogs were 99 times more common than cats in your dataset? That would affect the probability of being a cat versus a dog.

Bayes Theorem We have P(X Y), but we really want P(Y X) Bayes theorem (or Bayes rule): P(B A) = P(A B) P(B) P(A)

Naïve Bayes Naïve Bayes is a classification algorithm that classifies an instance based on P(Y X), where P(Y X) is calculated using Bayes rule: P(Y X) = P(X Y) P(Y) P(X) Why naïve? We ll come back to that.

Naïve Bayes Naïve Bayes is a classification algorithm that classifies an instance based on P(Y X), where P(Y X) is calculated using Bayes rule: P(Y X) = P(X Y) P(Y) P(X) Called the prior probability of Y Usually just calculated as the percentage of training instances labeled as Y

Naïve Bayes Naïve Bayes is a classification algorithm that classifies an instance based on P(Y X), where P(Y X) is calculated using Bayes rule: P(Y X) = P(X Y) P(Y) P(X) Called the posterior probability of Y The conditional probability of Y given an instance X

Naïve Bayes Naïve Bayes is a classification algorithm that classifies an instance based on P(Y X), where P(Y X) is calculated using Bayes rule: P(Y X) = P(X Y) P(Y) P(X) This conditional probability is what needs to be learned

Naïve Bayes Naïve Bayes is a classification algorithm that classifies an instance based on P(Y X), where P(Y X) is calculated using Bayes rule: P(Y X) = P(X Y) P(Y) P(X) What about P(X)? Probability of observing the data Doesn t actually matter! P(X) is the same regardless of Y Doesn t change which Y has highest probability

Example Classify an animal that weighs 14 pounds Also: dogs are 99 times more common than cats in the data P(weight=14 animal=cat) =.027 P(animal=cat weight=14) =?

Example Classify an animal that weighs 14 pounds Also: dogs are 99 times more common than cats in the data P(weight=14 animal=cat) =.027 P(animal=cat weight=14) P(weight=14 animal=cat) P(animal=cat) = 0.027 * 0.01 = 0.00027

Example Classify an animal that weighs 14 pounds Also: dogs are 99 times more common than cats in the data P(weight=14 animal=dog) =.004 P(animal=dog weight=14) P(weight=14 animal=dog) P(animal=dog) = 0.004 * 0.99 = 0.00396

Example Classify an animal that weighs 14 pounds Also: dogs are 99 times more common than cats in the data P(animal=dog weight=14) > P(animal=cat weight=14) You should classify the animal as a dog.

Naïve Bayes Learning: Estimate P(X Y) from the data Estimate P(Y) from the data Prediction: Choose Y that maximizes: P(X Y) P(Y)

Naïve Bayes Learning: Estimate P(X Y) from the data??? Estimate P(Y) from the data Usually just calculated as the percentage of training instances labeled as Y

Naïve Bayes Learning: Estimate P(X Y) from the data Requires some decisions (and some math) Estimate P(Y) from the data Usually just calculated as the percentage of training instances labeled as Y

Defining P(X Y) With continuous features, a normal distribution is a common way to define P(X Y) But keep in mind that this is only an approximation: the true probability might be something different Other probability distributions exist that you can use instead (not discussed here) With discrete features, the observed distribution (i.e., the proportion of instances with each value) is usually used as-is

Defining P(X Y) Another complication Instances are usually vectors of many features How do you define the probability of an entire feature vector?

Joint Probability The probability of multiple variables is called the joint probability Example: if you roll two dice, what s the probability that they both land 5?

Joint Probability 36 possible outcomes: 1,1 2,1 3,1 4,1 5,1 6,1 1,2 2,2 3,2 4,2 5,2 6,2 1,3 2,3 3,3 4,3 5,3 6,3 1,4 2,4 3,4 4,4 5,4 6,4 1,5 2,5 3,5 4,5 5,5 6,5 1,6 2,6 3,6 4,6 5,6 6,6

Joint Probability 36 possible outcomes: 1,1 2,1 3,1 4,1 5,1 6,1 1,2 2,2 3,2 4,2 5,2 6,2 1,3 2,3 3,3 4,3 5,3 6,3 1,4 2,4 3,4 4,4 5,4 6,4 1,5 2,5 3,5 4,5 5,5 6,5 1,6 2,6 3,6 4,6 5,6 6,6 Probability of two 5s: 1/36

Joint Probability 36 possible outcomes: 1,1 2,1 3,1 4,1 5,1 6,1 1,2 2,2 3,2 4,2 5,2 6,2 1,3 2,3 3,3 4,3 5,3 6,3 1,4 2,4 3,4 4,4 5,4 6,4 1,5 2,5 3,5 4,5 5,5 6,5 1,6 2,6 3,6 4,6 5,6 6,6

Joint Probability 36 possible outcomes: 1,1 2,1 3,1 4,1 5,1 6,1 1,2 2,2 3,2 4,2 5,2 6,2 1,3 2,3 3,3 4,3 5,3 6,3 1,4 2,4 3,4 4,4 5,4 6,4 1,5 2,5 3,5 4,5 5,5 6,5 1,6 2,6 3,6 4,6 5,6 6,6 Probability the first is a 5 and the second is anything but 5: 5/36

Joint Probability A quicker way to calculate this: The probability of two variables is the product of the probability of each individual variable Only true if the two variables are independent! (defined on next slide) Probability of one die landing 5: 1/6 Joint probability of two dice landing 5 and 5: 1/6 * 1/6 = 1/36

Joint Probability A quicker way to calculate this: The probability of two variables is the product of the probability of each individual variable Only true if the two variables are independent! (defined on next slide) Probability of one die landing anything but 5: 5/6 Joint probability of two dice landing 5 and not 5: 1/6 * 5/6 = 5/36

Independence Multiple variables are independent if knowing the outcome of one does not change the probability of another If I tell you that the first die landed 5, it shouldn t change your belief about the outcome of the second (every side will still have 1/6 probability) Dice rolls are independent

Conditional Independence Naïve Bayes treats the feature probabilities as independent (conditioned on Y) P(<X 1, X 2,, X M > Y) = P(X 1 Y) * P(X 2 Y) * P(X M Y) Features are usually not actually independent! Treating them as if they are is considered naïve But it s often a good enough approximation This makes the calculation much easier

Conditional Independence Important distinction: the features have conditional independence: the independence assumption only applies to the conditional probabilities P(X Y) Conditional independence: P(X 1, X 2 Y) = P(X 1 Y) * P(X 2 Y) Not necessarily true that P(X 1, X 2 ) = P(X 1 ) * P(X 2 )

Conditional Independence Example: Suppose you are classifying the category of a news article using word features If you observe the word baseball, this would increase the likelihood that the word homerun will appear in the same article These two features are clearly not independent But if you already know the article is about baseball (Y=baseball), then observing the word baseball doesn t change the probability of observing other baseball-related words

Defining P(X Y) Naïve Bayes is most often used with discrete features With discrete features, the probability of a particular feature value is usually calculated as: # of times the feature has that value total # of occurrences of the feature

Document Classification Naïve Bayes is often used for document classification Given the document class, what is the probability of observing the words in the document?

Document Classification Example: 3 documents: the water is cold the pig went home the home is cold P( the ) = 3/12 P( is ) = 2/12 P( home ) = 2/12 P( cold ) = 2/12 P( water ) = 1/12 P( went ) = 1/12 P( pig ) = 1/12 P( the water is cold ) = P( the ) P( water ) P( is ) P( cold )

Document Classification Example: 3 documents: the water is cold the pig went home the home is cold P( the ) = 3/12 P( is ) = 2/12 P( home ) = 2/12 P( cold ) = 2/12 P( water ) = 1/12 P( went ) = 1/12 P( pig ) = 1/12 P( the water is very cold ) = P( the ) P( water ) P( is ) P( very ) P( cold )

Document Classification Example: 3 documents: the water is cold the pig went home the home is cold P( the ) = 3/12 P( is ) = 2/12 P( home ) = 2/12 P( cold ) = 2/12 P( water ) = 1/12 P( went ) = 1/12 P( pig ) = 1/12 P( very ) = 0/12 P( the water is very cold ) = P( the ) P( water ) P( is ) P( very ) P( cold ) = 0

Document Classification Example: 3 documents: the water is cold the pig went home the home is cold P( the ) = 3/12 P( is ) = 2/12 P( home ) = 2/12 P( cold ) = 2/12 P( water ) = 1/12 P( went ) = 1/12 P( pig ) = 1/12 P( very ) = 0/12 One trick: pretend every value occurred one more time than it did

Document Classification Example: 3 documents: the water is cold the pig went home the home is cold P( the ) = 4/12 P( is ) = 3/12 P( home ) = 3/12 P( cold ) = 3/12 P( water ) = 2/12 P( went ) = 2/12 P( pig ) = 2/12 P( very ) = 1/12 One trick: pretend every value occurred one more time than it did

Document Classification Example: 3 documents: the water is cold the pig went home the home is cold P( the ) = 4/20 P( is ) = 3/20 P( home ) = 3/20 P( cold ) = 3/20 P( water ) = 2/20 P( went ) = 2/20 P( pig ) = 2/20 P( very ) = 1/20 Need to adjust both numerator and denominator

Smoothing Adding pseudocounts to the observed counts when estimating P(X Y) is called smoothing Smoothing makes the estimated probabilities less extreme It is one way to perform regularization in Naïve Bayes (reduce overfitting)

Generative vs Discriminative The conventional wisdom is that discriminative models generally perform better because they directly model what you care about, P(Y X) When to use generative models? Generative models have been shown to need less training data to reach peak performance Generative models are more conducive to unsupervised and semi-supervised learning More on that point next week