Probabilistic Classification

Similar documents
Why Probability? It's the right way to look at the world.

Review: Bayesian learning and inference

Naïve Bayes classification

Directed Graphical Models

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

Graphical Models - Part I

Bayesian Approaches Data Mining Selected Technique

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Bayesian Networks. Motivation

CMPSCI 240: Reasoning about Uncertainty

The Naïve Bayes Classifier. Machine Learning Fall 2017

Probabilistic Graphical Models (I)

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Fall 2009

Informatics 2D Reasoning and Agents Semester 2,

CS 5522: Artificial Intelligence II

Probability. CS 3793/5233 Artificial Intelligence Probability 1

CS6220: DATA MINING TECHNIQUES

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

Bayes Nets: Independence

CS6220: DATA MINING TECHNIQUES

Introduction to Machine Learning

Announcements. CS 188: Artificial Intelligence Spring Probability recap. Outline. Bayes Nets: Big Picture. Graphical Model Notation

Introduction to Machine Learning

Bayesian Networks. Vibhav Gogate The University of Texas at Dallas

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

CSE 473: Artificial Intelligence Autumn 2011

Introduction to Bayesian Learning. Machine Learning Fall 2018

Probabilistic Models

Probabilistic Graphical Models and Bayesian Networks. Artificial Intelligence Bert Huang Virginia Tech

Logistics. Naïve Bayes & Expectation Maximization. 573 Schedule. Coming Soon. Estimation Models. Topics

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Bayesian networks. Soleymani. CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2018

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks BY: MOHAMAD ALSABBAGH

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Which coin will I use? Which coin will I use? Logistics. Statistical Learning. Topics. 573 Topics. Coin Flip. Coin Flip

Bayesian Learning. Bayesian Learning Criteria

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Learning Bayesian Networks (part 1) Goals for the lecture

Introduction to Bayesian Learning

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Introduction to Artificial Intelligence. Unit # 11

CS 5522: Artificial Intelligence II

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Quantifying uncertainty & Bayesian networks

10/15/2015 A FAST REVIEW OF DISCRETE PROBABILITY (PART 2) Probability, Conditional Probability & Bayes Rule. Discrete random variables

Bayesian Networks. Semantics of Bayes Nets. Example (Binary valued Variables) CSC384: Intro to Artificial Intelligence Reasoning under Uncertainty-III

Lecture 6: Graphical Models

PROBABILISTIC REASONING SYSTEMS

A.I. in health informatics lecture 3 clinical reasoning & probabilistic inference, II *

Stephen Scott.

Bayesian Learning. Instructor: Jesse Davis

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Recall from last time. Lecture 3: Conditional independence and graph structure. Example: A Bayesian (belief) network.

CS 343: Artificial Intelligence

Generative Clustering, Topic Modeling, & Bayesian Inference

Rapid Introduction to Machine Learning/ Deep Learning

Bayes Networks 6.872/HST.950

Intro to AI: Lecture 8. Volker Sorge. Introduction. A Bayesian Network. Inference in. Bayesian Networks. Bayesian Networks.

Bayesian Learning (II)

CS 188: Artificial Intelligence Fall 2008

Intelligent Systems: Reasoning and Recognition. Reasoning with Bayesian Networks

Events A and B are independent P(A) = P(A B) = P(A B) / P(B)

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Bayesian networks (1) Lirong Xia

15-780: Graduate Artificial Intelligence. Bayesian networks: Construction and inference

Bayesian networks. Chapter 14, Sections 1 4

Graphical Models and Kernel Methods

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Probabilistic representation and reasoning

Probabilistic Graphical Networks: Definitions and Basic Results

Probabilistic representation and reasoning

Artificial Intelligence Bayes Nets: Independence

Lecture 5: Bayesian Network

Directed and Undirected Graphical Models

Bayesian belief networks. Inference.

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Bayesian Networks aka belief networks, probabilistic networks. Bayesian Networks aka belief networks, probabilistic networks. An Example Bayes Net

Bayesian Methods: Naïve Bayes

Introduction to Probabilistic Reasoning. Image credit: NASA. Assignment

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Bayesian belief networks

Artificial Intelligence Bayesian Networks

CSE 473: Artificial Intelligence Autumn Topics

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Notes on Machine Learning for and

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Directed Graphical Models or Bayesian Networks

Reasoning Under Uncertainty: Belief Network Inference

Bayesian networks. Chapter Chapter

Transcription:

Bayesian Networks

Probabilistic Classification Goal: Gather Labeled Training Data Build/Learn a Probability Model Use the model to infer class labels for unlabeled data points Example: Spam Filtering...

(Simplistic) Spam Filtering SPAM viagra discount cs444 count T T T 0 T T F 180 T F T 0 T F F 1200 F T T 8 F T F 600 F F T 12 F F F 6000 Total: 8000 NON SPAM viagra discount cs444 count T T T 0 T T F 0 T F T 1 T F F 3 F T T 6 F T F 20 F F T 70 F F F 700 Total: 800 Quiz! Estimate: P ( spam)=?? P ( viagra, discount, cs444 spam)=?? P ( spam)=?? P ( viagra, discount, cs 444 spam)=??

(Simplistic) Spam Filtering SPAM viagra discount cs444 count T T T 0 T T F 180 T F T 0 T F F 1200 F T T 8 F T F 600 F F T 12 F F F 6000 Total: 8000 NON SPAM viagra discount cs444 count T T T 0 T T F 0 T F T 1 T F F 3 F T T 6 F T F 20 F F T 70 F F F 700 Total: 800 Quiz! Estimate: P(spam)=8000/8800.91 P( viagra,discount, cs 444 spam)=8/8000=.001 P( spam).09 P( viagra,discount, cs 444 spam)=6/ 800=.0075

Aside: Maximum Likelihood Learning This approach makes sense What is the justification? This is the maximum likelihood estimate: ^ϕ=argmax ϕ Φ P(observed data ϕ) Where represents the parameters we are trying to learn. ϕ

Bayes' Optimal Classifier I (No Independence Assumptions) Assume a multivalued random variable C that can take on the values for i = 1 to i=k. Assume M input attributes Xj for j = 1 to M. Learn P(X1, X 2,...,X M ) for each i. Treat this as K different joint PDs. Given a set of input values X1 =u 1, X 2 =u 2,...,X M =u M, classification is easy (?): C predict =argmax P(C= X 1 =u 1,, X M =u m )

Bayes' Optimal Classifier II (No Independence Assumptions) C predict =argmax P(C= X 1 =u 1,, X M =u m ) Apply Bayes' rule P( X C predict 1 =u 1,, X M =u m C= )P(C= ) =argmax P( X 1 =u 1,, X M =u m ) Conditioning (applying law of total probability): P( X C predict 1 =u 1,, X M =u m C= )P(C= ) =argmax K P( X 1 =u 1,, X M =u m C= )P(C= ) i=1

An Aside: MAP vs. ML This is a maximum a posteriori (MAP) classifier: C predict =argmax P(C= X 1 =u 1,, X M =u m ) We could also consider a maximum likelihood (ML) classifier: C predict =argmax P(X 1 =u 1,, X M =u m C= )

Bayes' Optimal Classifier III (No Independence Assumptions) P( X C predict 1 =u 1,, X M =u m C= )P(C= ) =argmax K P( X 1 =u 1,, X M =u m C= )P(C= ) i=1 Notice that the denominator is the same for all classes. We can simplify this to: C predict =argmax P X 1 =u 1, X 2 =u 2,, X M =u m C= P C= If you have the true distributions, this is the best choice.

Bayes' Optimal Classifier Example C predict =argmax P X 1 =u 1, X 2 =u 2,, X M =u m C= P C= Email contains discount, CS444, but not viagra. Is it spam? Recall: P (spam).91 P ( spam).9 P ( viagra, discount, cs 444 spam)=.001 P ( viagra, discount, cs 444 spam)=.0075

Bayes' Optimal Classifier Example C predict =argmax P X 1 =u 1, X 2 =u 2,, X M =u m C= P C= Email contains discount, CS444, but not viagra. Is it spam? Recall: P (spam).91 P ( spam).09 P ( viagra, discount, cs 444 spam)=.001 P ( viagra, discount, cs 444 spam)=.0075 P( viagra, discount, cs 444 spam) P(spam)=.001.91=.00091 P( viagra, discount, cs 444 spam)p( spam)=.0075.09=.000675

The Problem... Assume we want to use hundreds of words. What is the problem here? C predict =argmax P X 1 =u 1, X 2 =u 2,, X M =u m C= P C=

The Problem... Assume we want to use hundreds of words. What is the problem here? C predict =argmax P X 1 =u 1, X 2 =u 2,, X M =u m C= P C= If M is largish it is impossible to learn P(X 1, X 2,...,X M ).

(Changing Gears) Review Joint probability distributions are the key to probabilistinference. If all N variables are completely independent, we can represent the full joint distribution using N numbers. If every variable depends on every other, we need to store values There must be something in-between...

Conditional Independence Saying that X is conditionally independent of Y given Z means: equivalently: Y Z This graph encodes the same thing: X

Bayesian Networks Directed acyclic graphs that have the following interpretation: Each variable is conditionally independent of all it's non-descendants, given the value of it's parents. A B C D E F

Bayesian Networks A Bayesian network is a directed acyclic graph that represents causal (ideally) relationships between random variables. Burglary Earthquake Weather Health Alarm Mood Phone Call

Specifying a Bayes' Net We need to specify: The topology of the network. The conditional probabilities. P(b).001 Burglary P(e).002 Earthquake A P(c) T.9 F.05 Alarm Phone Call B E P(a) T T.95 T F.94 F T.29 F F.001

Bayesian Networks We can then reconstruct any entry from the joint distribution using: N P X 1, X 2,, X N = i=1 P X i parents X i In other words, the complete joint probability distribution can be reconstructed from the N conditional distributions. For N binary valued variables with M parents each 2 N vs. N * 2 M

Simple Application: Naive Bayes classifier Back to spam filtering... Assumption: spamminess impacts the probability of different words. Words are conditionally independent of each other given spam/non-spam status. Consider four boolean random variables: Spam (message is spam) Viagra (message contains word viagra ) Discount (message contains word discount ) CS444 (message contains word CS444 ) What will the graphical model look like?

Belief Network Spam Viagra Discount CS444 Now we need to specify the probabilities How does this help with inference...

Naïve Bayes' Classifier Remember: If M is large, it is impossible to learn P(X 1, X 2,...,X M ). The solution (?): assuming that the Xj are independent given C: P( X 1,, X M )= j=1 The naïve Bayes' classifier: C predict =argmax M M P C=c i j=1 P( X j ) P X j

Naïve Probability Estimation SPAM viagra discount cs444 count T T T 0 T T F 180 T F T 0 T F F 1200 F T T 8 F T F 600 F F T 12 F F F 6000 Total: 8000 NON SPAM viagra discount cs444 count T T T 0 T T F 0 T F T 1 T F F 3 F T T 6 F T F 20 F F T 70 F F F 700 Total: 800 Quiz! Estimate: P ( viagra, discount, cs444 spam)=??

Naïve Probability Estimation SPAM viagra discount cs444 count T T T 0 T T F 180 T F T 0 T F F 1200 F T T 8 F T F 600 F F T 12 F F F 6000 Total: 8000 NON SPAM viagra discount cs444 count T T T 0 T T F 0 T F T 1 T F F 3 F T T 6 F T F 20 F F T 70 F F F 700 Total: 800 Quiz! Estimate: P( viagra, discount,cs 444 spam)= 6620 8000 788 8000 20 8000.0002

Why is that Naïve? The symptoms probably aren't independent given the disease. Assuming they are allows us to classify based on thousands of attributes. This seems to work pretty well in practice. Do you see any advantage of this relative to the classifiers we saw earlier?

An Note on Implementation if M is largish this product can get really small. Too small. C predict =argmax M P C=c i j=1 P X j Solution: C predict =argmax ( M log P(C=)+ j=1 log P( X j )) Remember that log(ab) = log(a) + log(b)