Naive Bayes Classifier. Danushka Bollegala

Similar documents
Decision Trees. Danushka Bollegala

The Naïve Bayes Classifier. Machine Learning Fall 2017

Bayesian Learning Features of Bayesian learning methods:

Bayesian Methods: Naïve Bayes

Data classification (II)

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

UVA CS / Introduc8on to Machine Learning and Data Mining

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Quiz3_NaiveBayesTest

Data Mining Part 4. Prediction

COMP 328: Machine Learning

Bayesian Learning. Bayesian Learning Criteria

Behavioral Data Mining. Lecture 2

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Be able to define the following terms and answer basic questions about them:

Machine Learning Algorithm. Heejun Kim

Chapter 4.5 Association Rules. CSCI 347, Data Mining

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Naïve Bayes Lecture 6: Self-Study -----

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Basic Concepts of Probability

Bayesian Classification. Bayesian Classification: Why?

Probability & Random Variables

Consider an experiment that may have different outcomes. We are interested to know what is the probability of a particular set of outcomes.

Bayesian RL Seminar. Chris Mansley September 9, 2008

A.I. in health informatics lecture 3 clinical reasoning & probabilistic inference, II *

Probability Review and Naïve Bayes

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Handling Uncertainty

Algorithms for Classification: The Basic Methods

The Solution to Assignment 6

CS145: INTRODUCTION TO DATA MINING

Basic Concepts of Probability

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano

Mining Classification Knowledge

Artificial Intelligence: Reasoning Under Uncertainty/Bayes Nets

Probability Theory. Prof. Dr. Martin Riedmiller AG Maschinelles Lernen Albert-Ludwigs-Universität Freiburg.

Rule Generation using Decision Trees

Basic Probability. Robert Platt Northeastern University. Some images and slides are used from: 1. AIMA 2. Chris Amato

Probability Theory and Applications

Machine Learning Recitation 8 Oct 21, Oznur Tastan

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Machine Learning: Probability Theory

Review of Probability Mark Craven and David Page Computer Sciences 760.

Probability Based Learning

Classification Using Decision Trees

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Dept. of Linguistics, Indiana University Fall 2015

BAYES CLASSIFIER. Ivan Michael Siregar APLYSIT IT SOLUTION CENTER. Jl. Ir. H. Djuanda 109 Bandung

Denker FALL Probability- Assignment 6

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Classification and Regression Trees

Bayesian Learning. Instructor: Jesse Davis

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Machine Learning

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

{ p if x = 1 1 p if x = 0

Classification: Rule Induction Information Retrieval and Data Mining. Prof. Matteo Matteucci

Generative Models for Discrete Data

Why Probability? It's the right way to look at the world.

Stephen Scott.

Econ 325: Introduction to Empirical Economics

Probability and Probability Distributions. Dr. Mohammed Alahmed

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Discrete Random Variables

Machine Learning for natural language processing

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Lecture 2: Probability, Naive Bayes

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

Bayes Formula. MATH 107: Finite Mathematics University of Louisville. March 26, 2014

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Introduction to Machine Learning

What is Probability? Probability. Sample Spaces and Events. Simple Event

Vibhav Gogate University of Texas, Dallas

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Bayesian Updating: Discrete Priors: Spring

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Information Science 2

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides

Be able to define the following terms and answer basic questions about them:

Basic Probability and Statistics

Compound Events. The event E = E c (the complement of E) is the event consisting of those outcomes which are not in E.

Introduction to Machine Learning

Artificial Intelligence Uncertainty

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

UVA CS 4501: Machine Learning

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Overview of Probability. Mark Schmidt September 12, 2017

Introduction to Machine Learning

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Chapter 3 Conditional Probability and Independence. Wen-Guey Tzeng Computer Science Department National Chiao Tung University

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Mining Classification Knowledge

Probability Theory for Machine Learning. Chris Cremer September 2015

MA : Introductory Probability

Naïve Bayes classification

Transcription:

Naive Bayes Classifier Danushka Bollegala

Bayes Rule The probability of hypothesis H, given evidence E P(H E) = P(E H)P(H)/P(E) Terminology P(E): Marginal probability of the evidence E P(H): Prior probability of hypothesis H P(E H): Likelihood of the evidence given hypothesis P(H E): Posterior probability of the hypothesis H 2

Example Meningitis causes a stiff neck 50% of the time. Meningitis occurs 1/50000 and stiff neck occurs 1/20. Compute the probability of meningitis, given that the patient has a stiff neck? H = meningitis, E = stiff neck P(H) = 1/50000, P(E) = 1/20, P(E H) = 0.5 From Bayes rule we have P(H E) = P(E H)P(H)/P(E) = 0.0002 3

E 2 Intuition/Derivation P(E 2 ) P(H E 2 ) P(H E 1 ) H P(H) E 1 P(E 1 ) H can be related to several evidences E 1, E 2,, E n P(H) = P(E 1 )P(H E 1 ) + P(E 2 )P(H E 2 )+ +P(E n )P(H E n )! By the definition of conditional probability we have P(H E) = P(H,E)/P(E) P(E H) = P(H,E)/P(H) Dividing one from the other we get P(H E) = P(E H)P(H)/P(E) 4

Bayes Rule Proportional Form Often the evidence is given (P(E) is fixed) and we need to select from a set of hypothesis h 1,h 2,,h k In such cases we can simplify the formula to P(H E) P(E H)P(H) posterior likelihood x prior At least remember this form! 5

Naive Bayes Let us assume a particular hypothesis H depends on several evidences E 1,E 2,,E n From Bayes rule we have P(H E 1,E 2,,E n ) P(E 1,E 2,,E n H)P(H) Let us further assume that given the hypothesis H, the evidences are mutually exclusive Then we can decompose the likelihood term P(H E 1,E 2,,E n ) P(E 1 H) P(E n H)P(H) This independence assumption is what make naive bayes so naive! 6

Independent Events Joint probability of independent events P(A,B) = P(A B)P(B) This holds for ANY two random events A and B, irrespective of whether they are independent or not. But if A is independent of B, then B s occurrence has no consequence on A P(A B) = P(A) Therefore, when A and B are independent P(A,B) = P(A)P(B) 7

Being naive makes life easy Let H=engine-does-not-start, and evidences A = weak-battery and B = no-gas P(H A,B) = P(A,B H)P(H)/P(A,B) We must estimate the likelihood P(A,B H) If A and B are mutually independent given H P(A,B H) = P(A H)P(B H) P(A H) can be estimated by finding how many cars had engine not working because of a weak battery P(B H) can be estimated by finding how many cars had engines not working because of no gas On the other hand, if we tried to estimate P(A,B H) directly then we need to find how many cars had engines not working due to a weak battery and no gas. Such cases could be rare making our estimate of P(A,B H) unreliable or zero (in the worst case). Making the independence assumption makes estimates possible in practice. 8

Predicting whether play=yes 9 5/14 5 No 9/14 9 Yes Play 3/5 2/5 3 2 No 3/9 6/9 3 6 Yes True False True False Windy 1/5 4/5 1 4 No Yes No Yes No Yes 6/9 3/9 6 3 Normal High Normal High Humidity 1/5 2/5 2/5 1 2 2 3/9 4/9 2/9 3 4 2 Cool 2/5 3/9 Rainy Mild Hot Cool Mild Hot Temperature 0/5 4/9 Overcast 3/5 2/9 Sunny 2 3 Rainy 0 4 Overcast 3 2 Sunny Outlook

To play or not to play Given a test instance x = (outlook=sunny, temp=cool, humidity=high, windy=true)! P(play=yes x) P(x play=yes)p(play=yes) = P(outlook=sunny play=yes)xp(temp=cool play=yes)x P(humidity=high play=yes)xp(windy=true play=yes) x P(play=yes)! =2/9 x 3/9 x 3/9 x 3/9 x 9/14 = 0.00529! P(play=no x) P(x play=no)p(play=no) = P(outlook=sunny play=no)xp(temp=cool play=no)x P(humidity=high play=no)xp(windy=true play=no) x P(play=no)! =3/5 x 1/5 x 4/5 x 3/5 x 5/14 = 0.0020 Therefore play=yes. 10

Computing probabilities Note that P(play=yes x) 0.0052 P(play=no x) 0.0020 How can we compute the actual probabilities? Note that P(play=yes x) + P(play=no x) = 1 Therefore, P(play=yes x) = 0.0052 / (0.0052 + 0.0020) = 0.72 11

Sometimes it is too naive Naive Bayes assumption that the features are independent given the hypothesis is sometimes too naive to be true. The probability of Liverpool winning a football match is not independent of the probability for each member of the team scoring a goal. However, as we saw in a previous slide, it gives us a method to estimate the joint distribution of a set of random variables without getting into data sparseness issues. The linear classifiers we studied in the module so far such as the perceptron are also making such assumptions about the feature independence (the activation score is a linearly weighted sum after all) log(p(a,b H) = log(p(a H) x P(B H)) = log(p(a H)) + log(p(b H)) 12

Zero probabilities Issue: If a feature value does not co-occur with a class value, then the probability generated for it will be 0. Eg. Given outlook=overcast, the probability of play=no is 0/5. The other features will be ignored as the final result will be multiplied by 0. This is bad for our 4 feature dataset, but terrible for (say) a 1000 feature dataset. In text classification, we often encounter situations where a feature does not occur in a particular class. 13

Laplace Smoothing We can borrow some probabilities from high probability features and distribute them among zero probability features to avoid having feature with zero probabilities This is called smoothing There are numerous smoothing techniques based on different policies. As long as the total probability mass remains unchanged any policy of probability reassignment is valid. A popular method is called Laplace smoothing Add 1 to all counts to avoid zeros! P(w) = (count(w) + 1) / (N + V ) count(w): the actual count (before smoothing) of a word w N: corpus size in words (total number of counts of all words) V : vocabulary size (how many different words do we have) 14 X count(w) w2v

Quiz Let count(cat)=9, count(dog)=0, and count(rabbit) = 1. Compute the probabilities P(cat), P(dog), and P(rabbit). Now smooth the above probabilities using Laplace smoothing. 15

Document Classification A document can be represented using a bag of words (features such as unigrams and bigrams). We could represent a document by a vector where each element corresponds to the total frequency of a feature in the document. D = the burger i ate was an awesome burger v(d) = {the:1, burger:2, i:1, ate:1, was:1, an:1, awesome:1} Assuming the features to be independent (the naive assumption) we can compute the likelihood (probability) of this document D, p(d) as follows p(d) = p(the)1 p(burger) 2 p(ate) 1 p(was) 1 p(an) 1 p(awesome) 1 If a word w occurs n times in D, then the term corresponding to p(w) appears n times in the product. Therefore, we have p(w) n in the likelihood computation above. 16

Document Classification The Bayesian model is often used to classify documents as it deals well with a huge number of features simultaneously. But we might know how many times a word occurs in the document (vector representation in the previous slide) This leads to Multinomial Naive Bayes Assumptions: Probability of a word occurring in a document is independent of its location within the document. The document length is not related to the class. 17

Multinomial Distribution This is an extension of the Binomial distribution for more than two classes Binomial distribution What is the probability that when I flip a coin n times I will get k number of heads (H) and (n-k) number of tails (T)?!! p(h = k, T = n k) = n! k!(n k)! pk (1 p) (n k) Multinomial distribution Lets flip a dice instead of a coin. There are six outcomes. p(1 = a, 2=b,..., 6=f) = n! a!b!...f! p(1)a p(2) b...p(6) f n = a + b + c + d + e + f 18

Document Classification Y P (x y) =N!! w2d p(w y) h(w,d) h(w, D)! N = number of words in the document x p(w y) = probability that the word w occurs in class y h(w,d) = total occurrences of the word w in document D 19

Classifying burger sentiment Let us assume that we would like to classify whether the following document is positive (1) or negative (-1) in sentiment. D = the burger i ate was an awesome burger Further assume we see these words in positive and negative classes as follows. To make our computations easier let us assume that we removed the, i, was, an as stop words. word +1-1 burger 3 2 ate 3 2 awesome 4 1 20

Computing class conditional probabilities word +1-1 p(w +1) p(w -1) burger 3 2 3/10 2/5 ate 3 2 3/10 2/5 awesome 4 1 4/10 1/5 21

Quiz Using the probabilities in the table in slide 21 and multinomial naive Bayes formula in slide 19, compute P(D +1) and P(D -1) 22

Quiz Assuming that both positive and negative classes have equal probability p(+1) = p(-1) = 0.5 Compute P(+1 D) and P(-1 D) for the document in slide 22. Is this document D positive or negative? 23