Bayesian Methods: Naïve Bayes

Similar documents
Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Point Estimation. Vibhav Gogate The University of Texas at Dallas

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Dimensionality reduction

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Naïve Bayes classification

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Estimating Parameters

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Machine Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

The Naïve Bayes Classifier. Machine Learning Fall 2017

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

MLE/MAP + Naïve Bayes

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Bayesian Learning (II)

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Machine Learning

CSE446: Naïve Bayes Winter 2016

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

MLE/MAP + Naïve Bayes

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Naïve Bayes Lecture 17

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Mixture Models & EM. Nicholas Ruozzi University of Texas at Dallas. based on the slides of Vibhav Gogate

CSE546: Naïve Bayes Winter 2012

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

CPSC 340: Machine Learning and Data Mining

Chapter 3: Maximum-Likelihood & Bayesian Parameter Estimation (part 1)

CSC321 Lecture 18: Learning Probabilistic Models

COMP 328: Machine Learning

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Introduction to Machine Learning CMU-10701

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Gaussians Linear Regression Bias-Variance Tradeoff

CS 6375 Machine Learning

Point Estimation. Maximum likelihood estimation for a binomial distribution. CSE 446: Machine Learning

Introduction to Bayesian Learning. Machine Learning Fall 2018

Learning Bayesian network : Given structure and completely observed data

CSE 473: Artificial Intelligence Autumn Topics

Lecture : Probabilistic Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Bayesian Inference and MCMC

Bayesian RL Seminar. Chris Mansley September 9, 2008

Decision theory. 1 We may also consider randomized decision rules, where δ maps observed data D to a probability distribution over

Statistical learning. Chapter 20, Sections 1 3 1

Be able to define the following terms and answer basic questions about them:

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Lecture 11: Probability Distributions and Parameter Estimation

Announcements. Proposals graded

Computational Cognitive Science

Notes on Machine Learning for and

Generative Clustering, Topic Modeling, & Bayesian Inference

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Machine Learning, Fall 2012 Homework 2

Introduction to Probabilistic Machine Learning

Introduction to Machine Learning

Readings: K&F: 16.3, 16.4, Graphical Models Carlos Guestrin Carnegie Mellon University October 6 th, 2008

Introduction to Machine Learning

Bayesian Models in Machine Learning

Day 5: Generative models, structured classification

Introduction to Machine Learning

CS 361: Probability & Statistics

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Lecture 4. Generative Models for Discrete Data - Part 3. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza.

Probability Based Learning

COS513 LECTURE 8 STATISTICAL CONCEPTS

Machine Learning

Announcements. Proposals graded

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Probabilistic Machine Learning

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Machine Learning Algorithm. Heejun Kim

Parametric Techniques Lecture 3

The Bayes classifier

Bias-Variance Tradeoff

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

COMP90051 Statistical Machine Learning

Naive Bayes Classifier. Danushka Bollegala

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Logistic Regression. Machine Learning Fall 2018

Parametric Techniques

Density Estimation. Seungjin Choi

CS 361: Probability & Statistics

Machine Learning

Computational Cognitive Science

Transcription:

Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate

Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior distributions Posterior distributions Today: more parameter learning and naïve Bayes 2

Maximum Likelihood Estimation (MLE) Data: Observed set of α H heads and α T tails Hypothesis: Coin flips follow a binomial distribution Learning: Find the best θ MLE: Choose to maximize the likelihood (probability of D given θ) θ MLE = arg max θ p(d θ) 3

MAP Estimation Choosing θ to maximize the posterior distribution is called maximum a posteriori (MAP) estimation θ MAP = arg max θ p(θ D) The only difference between θ MLE and θ MAP is that one assumes a uniform prior (MLE) and the other allows an arbitrary prior 4

MLE for Gaussian Distributions Two parameter distribution characterized by a mean and a variance 5

Some properties of Gaussians Affine transformation (multiplying by scalar and adding a constant) are Gaussian X ~ (μ, σ 2 ) Y = ax + b Y ~ (aμ + b, a 2 σ 2 ) Sum of Gaussians is Gaussian X ~ (μ X, σ X 2), Y ~ (μ Y, σ Y 2) Z = X + Y Z ~ (μ X + μ Y, σ X 2 + σ Y 2) Easy to differentiate, as we will see soon! 6

Learning a Gaussian Collect data Hopefully, i.i.d. samples e.g., exam scores Learn parameters Mean: μ i Exam Score 0 85 1 95 2 100 3 12 99 89 Variance: σ 7

MLE for Gaussian: Probability of i.i.d. samples D = x (1),, x () p D μ, σ = 1 2πσ 2 e x i μ 2σ 2 2 Log-likelihood of the data ln p(d μ, σ) = 2 ln 2πσ2 x i μ 2 2σ 2 8

MLE for the Mean of a Gaussian μ ln p(d μ, σ) = μ 2 ln 2πσ2 x i μ 2 x i μ 2 2σ 2 = μ 2σ 2 = x i μ σ 2 x i = μ σ 2 = 0 μ MLE = 1 x i 9

MLE for Variance σ ln p(d μ, σ) = σ 2 ln 2πσ2 x i μ 2 = σ + x i σ μ 2 2σ 2 x i μ 2 = σ + σ 3 = 0 2σ 2 2 σ MLE = 1 x i μ MLE 2 10

Learning Gaussian parameters μ MLE = 1 x i 2 σ MLE = 1 x i μ MLE 2 MLE for the variance of a Gaussian is biased Expected result of estimation is not true parameter! Unbiased variance estimator 2 σ unbiased = 1 1 x i μ MLE 2 11

Bayesian Categorization/Classification Given features x = (x 1,, x m ) predict a label y If we had a joint distribution over x and y, given x we could the label using MAP inference arg max y p(y x 1,, x m ) Can compute this in exactly the same way that we did before: using Bayes rule p y x 1,, x m = p x 1,, x m y p y p x 1,, x m 12

Article Classification Given a collection of news articles labeled by topic, goal is, given a new news article to predict its topic One possible feature vector One feature for each word in the document, in order x i corresponds to the i th word x i can take a different value for each word in the dictionary 13

Text Classification 14

Text Classification x 1, x 2, is sequence of words in document The set of all possible features (and hence p(y x)) is huge Article at least 1000 words, x = (x 1,, x 1000 ) x i represents i th word in document Can be any word in the dictionary at least 10,000 words 10,000 1000 = 10 4000 possible values Atoms in Universe: 10 80 15

Bag of Words Model Typically assume position in document doesn t matter p X i = x i Y = y = p(x k = x i Y = y) All positions have the same distribution Ignores the order of words Sounds like a bad assumption, but often works well! Features Set of all possible words and their corresponding frequencies (number of times it occurs in the document) 16

Bag of Words aardvark 0 about2 all 2 Africa 1 apple0 anxious 0... gas 1... oil 1 Zaire 0 17

eed to Simplify Somehow Even with the bag of words assumption, there are too many possible outcomes Too many probabilities p(x 1,, x m y) Can we assume some are the same? p(x 1, x 2 y ) = p(x 1 y) p(x 2 y) This is a conditional independence assumption 18

Conditional Independence X is conditionally independent of Y given Z, if the probability distribution for X is independent of the value of Y, given the value of Z Equivalent to p X Y, Z = P(X Z) p X, Y Z = p X Z P(Y Z) 19

aïve Bayes aïve Bayes assumption Features are independent given class label p(x 1, x 2 y ) = p(x 1 y) p(x 2 y) More generally p x 1,, x m y = How many parameters now? m p(x i y) Suppose x is composed of m binary features 20

The aïve Bayes Classifier Given Prior p(y) Y m conditionally independent features X given the class Y X 1 X 2 X n For each X i, we have likelihood P(X i Y) Classify via y = h B x = arg max y = arg max y p y p(x 1,, x m y) m p(y) i p(x i y) 21

MLE for the Parameters of B Given dataset, count occurrences for all pairs Count(X i = x i, Y = y) is the number of samples in which X i = x i and Y = y MLE for discrete B p Y = y = p X i = x i Y = y = Count Y = y y Count(Y = y ) Count X i = x i, Y = y x i Count (X i = x i, Y = y) 22

aïve Bayes Calculations 23

Subtleties of B Classifier: #1 Usually, features are not conditionally independent: p x 1,, x m y m p(x i y) The naïve Bayes assumption is often violated, yet it performs surprisingly well in many cases Plausible reason: Only need the probability of the correct class to be the largest! Example: binary classification; just need to figure out the correct side of 0.5 and not the actual probability (0.51 is the same as 0.99). 24

Subtleties of B Classifier: #2 What if you never see a training instance (X 1 = a, Y = b) Example: you did not see the word Enlargement in spam! Then p X 1 = a Y = b = 0 Thus no matter what values X 2,, X m take Why? P X 1 = a, X 2 = x 2,, X m = x m Y = b = 0 25

Subtleties of B Classifier: #2 To fix this, use a prior! Already saw how to do this in the coin-flipping example using the Beta distribution For B over discrete spaces, can use the Dirichlet prior The Dirichlet distribution is a distribution over z 1,, z k (0,1) such that z 1 + + z k = 1 characterized by k parameters α 1,, α k f z 1,, z k ; α 1,, α k k z i α i 1 Called smoothing, what are the MLE estimates under these kinds of priors? 26