Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

Similar documents
Probabilistic Classification

CSCE 478/878 Lecture 6: Bayesian Learning

Stephen Scott.

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Notes on Machine Learning for and

Directed Graphical Models

Bayesian Approaches Data Mining Selected Technique

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

Probabilistic Graphical Models (I)

Introduction to Bayesian Learning

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Probabilistic Graphical Networks: Definitions and Basic Results

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Naïve Bayes classification

Lecture 9: Bayesian Learning

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Introduction to Artificial Intelligence. Unit # 11

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

CHAPTER-17. Decision Tree Induction

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Rapid Introduction to Machine Learning/ Deep Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Classification. Bayesian Classification: Why?

{ p if x = 1 1 p if x = 0

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Bayesian Learning Extension

COS402- Artificial Intelligence Fall Lecture 10: Bayesian Networks & Exact Inference

Review: Bayesian learning and inference

CS6220: DATA MINING TECHNIQUES

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CS6220: DATA MINING TECHNIQUES

Bayesian Learning (II)

COMP 328: Machine Learning

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

Final Examination CS 540-2: Introduction to Artificial Intelligence

Intelligent Systems (AI-2)

Lecture 6: Graphical Models

Reasoning Under Uncertainty: Belief Network Inference

STA 4273H: Statistical Machine Learning

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Bayesian Machine Learning

Learning Bayesian network : Given structure and completely observed data

Machine Learning Lecture 14

Directed and Undirected Graphical Models

Conditional Independence

Chapter 16. Structured Probabilistic Models for Deep Learning

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Introduction to Machine Learning Midterm, Tues April 8

Intelligent Systems (AI-2)

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Reasoning Under Uncertainty: More on BNets structure and construction

Bayesian Learning Features of Bayesian learning methods:

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

MODULE -4 BAYEIAN LEARNING

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Directed Graphical Models or Bayesian Networks

The Naïve Bayes Classifier. Machine Learning Fall 2017

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Bayesian Networks (Part II)

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Directed and Undirected Graphical Models

Summary of the Bayes Net Formalism. David Danks Institute for Human & Machine Cognition

Bayesian Networks Inference with Probabilistic Graphical Models

Probabilistic Machine Learning

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS 6375 Machine Learning

Machine Learning

Probabilistic Graphical Models

Naïve Bayesian. From Han Kamber Pei

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Logistic Regression. Machine Learning Fall 2018

Based on slides by Richard Zemel

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Bayesian Learning. Bayesian Learning Criteria

Learning Bayesian Networks (part 1) Goals for the lecture

Probabilistic Graphical Models for Image Analysis - Lecture 1

MODULE 10 Bayes Classifier LESSON 20

Introduction to Probabilistic Reasoning. Image credit: NASA. Assignment

Bayesian Networks. Motivation

Final Exam December 12, 2017

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

CMPSCI 240: Reasoning about Uncertainty

Naïve Bayes Classifiers

Graphical models and causality: Directed acyclic graphs (DAGs) and conditional (in)dependence

Graphical Models and Kernel Methods

Bayesian Learning. Instructor: Jesse Davis

Classification & Information Theory Lecture #8

Bayesian Methods in Artificial Intelligence

Pattern Recognition and Machine Learning

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Machine Learning for Data Science (CS4786) Lecture 19

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Logistic Regression & Neural Networks

Transcription:

Bayesian Inference The purpose of this document is to review belief networks and naive Bayes classifiers. Definitions from Probability: Belief networks: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers: At the end of lesson 9, Charles introduces the Bayes optimal classifier. Although this is the best performing classification model for a given hypothesis space, data set and a priori knowledge, the Bayes optimal classifier is computationally very costly. This is because the posterior probability P (h D) must be computed for each hypothesis h H and combined with the prediction P (v h) before v MAP can be computed. In lesson 10, Michael discusses Bayesian inference. The end goal of this lesson is to introduce an alternative classification model to the optimal Bayes classifier: the naive Bayes classifier. This model is much more computationally efficient than optimal Bayes classification, and under certain conditions it has performance 1 comparable to neural networks and decision trees. Naive Bayes classifiers represent a special case of classifiers derived from belief networks graphical models which represent a set of random 2 variables and their conditional dependencies. In these notes we review belief networks and the special case of naive Bayes classifiers, along with some definitions from probability. Definitions from Probability: In this section we recall a few definitions from probability that we will need moving forward. Feel free to skip this section if you are familiar with conditional probability and Bayes theorem. 1 Mitchell, Tom M. "Machine learning. 1997." Burr Ridge, IL: McGraw Hill 45 (1997). 2 "Bayesian network Wikipedia, the free encyclopedia." 2003. 9 May. 2014 <http://en.wikipedia.org/wiki/bayesian_network>

We say that X is conditionally independent of Y given Z if for all values ( x i, y j, z k ) we have P (X = x i Y = y j, Z = z k ) = P (X = x i Z = z k ). Writing out all definitions, we see that it is equivalent to say that for all values x, y, z ) we have ( i j k P (X = x i, Y = y j Z = z k ) = P (X = x i Z = z k ) P (Y = y j Z = z k ). We will also recall the following inferencing rules: The product rule (aka, the chain rule): P (X, Y ) = P (X Y ) P (Y ) = P (Y X) P (X) It is helpful to note that this rule also has the more general form: P (X 1,..., X n ) = P (X 1 X 2,..., X n) P (X 2 X 3,..., X n ) P (X n 1 X n ) P (X n ). Bayes theorem for a hypothesis h and data set D : P(D h) P(h) P (h D) = P(D) It is useful to note that Bayes theorem makes some sense heuristically. If we increase the probability of a certain hypothesis P (h), then we would naturally expect an increase in P (h D). Similarly, if some data D is more likely to occur in a world where hypothesis h is true (this probability is P (D h) ), then we might expect to see h given that we already see the data (this is P (h D) ). Last, if we increase P (D), the probability that data D is observed independently, then this decreases the support that D provides for any specific hypothesis h. Hence P (D) and P (h D) are inversely proportional. Marginalization (aka, the theorem of total probability): n If X 1,..., X n are mutually exclusive with P (A i ) = 1, then

P (Y ) = n (Y A ) P (A ). P i i Belief networks: As described in the introduction, belief networks (a.k.a. Bayes(ian) net(work)s, probabilistic directed acyclic graphical models) are graphical models that describe the probability distribution of a set of variables in terms of the variables conditional dependencies. So, in particular, given a set of random variables Y 1, Y 2,..., Y n, a network is called belief network if the joint probability distribution of the n tuple (Y 1... Y n ) can be written as P (Y 1,..., Y n ) = n (Y P arents(y ) ). P i i We will explain what variables are included in the set P arents(y i ) below. In fact, let s start with an example illustrating this idea, and we can nail down the definitions as we go. Let s suppose we are given the following variables: you are hungry!, you own a robot that can cook and clean, you will be eating dinner at home, and you will have dishes to do : In the network above, each node represents a variable, and arrows from different nodes represent conditional independence assumptions related to the variables. More on this below. For brevity, we will be referring to these variables as R obot = Y 1, H omedinner = Y 2, D ishes = Y 3, H ungry! = Y 4.

Here, we say Y i is a descendant of Y j if there exists a directed path from Y i to Y j. For belief networks, we define the Parents of a variable to be the the variable s immediate predecessors in the network. So, for example, D escendants(robot) = (Dishes, Dinner) P arents(dishes) = (Robot, HomeDinner, Hunger!) For each node in the network, we are given a conditional probability table describing the probability distribution of the corresponding variable, given the variable s parents. The belief network shows that for an assignment of values ( Y 1 = v 1,..., Y 4 = v 4 ), the joint probability can be written as P (v 1,..., v 4 ) = n (v P arents(y ) ). P i i Note here that P (v i P arents(y i ) ) is the information provided by the table for each node. Thus, in our case, given the belief network pictured above, we could determine the probability that you are hungry!, will eat dinner at home, not have to do dishes and own an awesome dish cleaning dinner making robot, P (Robot = 1, HomeDinner = 1, Dishes = 0, Hungry! = 1), with the following four values from the belief network tables: P (Robot = 1), P (HomeDinner = 1 Robot = 1, Hungry! = 1), P (Dishes = 0 Robot = 1, HomeDinner = 1, Hungry! = 1) P (Hungry! = 1). Ok, great! We have shown how you can use a belief net to compute joint probabilities, but what if you are given a set of random variables? Can you create a belief net that corresponds to these variables? It turns out you can. We demonstrate how to do this with the variables

X 1, X 2, X3 and X 4. Using the chain rule we can write: P (X 1, X 2, X 3, X 4 ) = P (X 4 X 3, X 2, X 1) P (X 3 X 2, X 1) P (X 2 X 1) P (X 1 ). For each conditional probability P (X i X i 1, X i 2,..., X 1 ) on the right side of the equation, we can choose the smallest subset of X i s, which will be the P arents(x i ), such that P (X i X i 1, X i 2,..., X 1 ) = P (X i P arents(x i )). For example, it may be the case for smallest subset gives P (X 4 X 3, X 2, X 1 ) that the P (X 4 X 3, X 2, X 1 ) = P (X 4 X 3, X 2 ), so we will denote P arents(x i ) = (X 3, X 2 ). Then to create the belief net, we just need to create a graph with incoming edges from each variable in P arents(x i ) to the variable X i, and a conditional probability table for P arents(x i ). Naive Bayes Classifiers: Naive Bayes classifiers are classifiers that represent a special case of the belief nets covered above, but with stronger independence assumptions. In particular, let s suppose we are given a classification variable V, and some attribute variables a 1,..., a n. The example of this seen in the lectures is the email spam filtering example where a 1 = V = spam, viagra, a 2 = prince, a 3 = Udacity. For our classifier to be a naive Bayes classifier, we make the (naive) assumption that every attribute variable is conditionally independent of every other attribute variable. Graphically, in terms of the belief nets described above, this is going to look like:

For the classification variable V, as always, we would like to find the most probable target value v map, given the values for our attributes. We can write the expression for and then use Bayes theorem to v map manipulate the expression as follows: v map = argmax v V P (v a, a, a ) j j 1 2..., n = argmax v V P( a 1, a 2,..., a n v j)p(v j) j P(a, a,..., a ) 1 2 n = argmaxv V P ( a, a, a v )P (v ). j 1 2..., n j j Next we would like to simplify the last expression. In what follows below, we use the general product rule for the first step, and then our naive conditional independence assumption for the second step: P ( a 1, a 2,..., a n v j ) = P (a 1 a 2,..., a n, v j )P (a 2 a 3,..., a n, v j ) P (a n v j ) = P (a 1 v j)p (a 2 v j) P (a n v j ). Substituting this equality into the formula for v map, and writing the product more compactly we have the following expression: v map = argmax P (v ) ( a v ). vj V j n P i j i=1 One great computational advantage of this formula is the small number of terms that must be estimated to compute. More v map precisely, for each classification category v j that V can take, we must estimate the n values P (a i v j ). Thus, the total number of terms to be estimated is just the number of attributes n multiplied by

the number of distinct values v j that V can take. A second computational advantage of the formula is the fact that each of the terms to be estimated is a one dimensional probability, which can be estimated with a smaller data set than the joint probability. By contrast, a direct estimate of the joint probability P ( a 1, a 2,..., a n v j ) 3 suffers from the curse of dimensionality. Recall from lesson four of the lectures, the curse of dimensionality occurs when the amount of data needed to develop an acceptable (i.e. non overfitted) classifier grows exponentially with the number of features. Advantages and Disadvantages of Naive Bayes Classifiers: As mentioned above, the naive Bayes classifier is very efficient for training, in terms of the total number of computations needed. Also, as mentioned in the introduction, naive Bayes performs well on many different training tasks. However, because of the strong conditional independence assumption placed on the attributes in the model, there are situations where naive Bayes is not appropriate. To get an idea of why this might be the case, consider the task of using naive Bayes to learn XOR. So here, our attributes will be the inputs X 1 and X 2, and we also have the classification variable V = (X 1 XOR X 2 ). Naive Bayes will consider each of the attributes independently and will be unable to accurately predict given input values X 1 = x 1 and X 2 = x 2. v map 3 Mitchell, Tom M. "Machine learning. 1997." Burr Ridge, IL: McGraw Hill 45 (1997).