Stephen Scott.

Similar documents
CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

CSCE 478/878 Lecture 6: Bayesian Learning

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

BAYESIAN LEARNING. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6]

Lecture 9: Bayesian Learning

Notes on Machine Learning for and

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Bayesian Learning Features of Bayesian learning methods:

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Machine Learning (CS 567)

Machine Learning. Bayesian Learning.

COMP 328: Machine Learning

Bayesian Learning. Remark on Conditional Probabilities and Priors. Two Roles for Bayesian Methods. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.

Two Roles for Bayesian Methods

Machine Learning. Bayesian Learning. Acknowledgement Slides courtesy of Martin Riedmiller

The Naïve Bayes Classifier. Machine Learning Fall 2017

Naïve Bayes classification

Probability Based Learning

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Learning. Bayesian Learning Criteria

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

CS6220: DATA MINING TECHNIQUES

Machine Learning. Bayesian Learning. Michael M. Richter. Michael M. Richter

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Bayesian Classification. Bayesian Classification: Why?

CS6220: DATA MINING TECHNIQUES

Introduction to Machine Learning

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Introduction to Machine Learning

Learning Decision Trees

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

MODULE -4 BAYEIAN LEARNING

Algorithms for Classification: The Basic Methods

UVA CS / Introduc8on to Machine Learning and Data Mining

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Probabilistic Reasoning. (Mostly using Bayesian Networks)

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

Building Bayesian Networks. Lecture3: Building BN p.1

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Computational Genomics

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Machine Learning 2nd Edi7on

Mining Classification Knowledge

Learning Decision Trees

T Machine Learning: Basic Principles

Mining Classification Knowledge

Bayesian Learning. Bayes Theorem. MAP, MLhypotheses. MAP learners. Minimum description length principle. Bayes optimal classier. Naive Bayes learner

Probabilistic Classification

Decision Tree Learning and Inductive Inference

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

Confusion matrix. a = true positives b = false negatives c = false positives d = true negatives 1. F-measure combines Recall and Precision:

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Machine Learning

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Lecture 3: Decision Trees

Introduction to Bayesian Learning

Decision Trees. Tirgul 5

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Intelligent Systems (AI-2)

CS 484 Data Mining. Classification 7. Some slides are from Professor Padhraic Smyth at UC Irvine

Why Probability? It's the right way to look at the world.

Directed Graphical Models

Introduction to Bayes Nets. CS 486/686: Introduction to Artificial Intelligence Fall 2013

Artificial Intelligence: Reasoning Under Uncertainty/Bayes Nets

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

Bayesian Networks. Motivation

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Uncertainty and Bayesian Networks

Learning Classification Trees. Sargur Srihari

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Classification II: Decision Trees and SVMs

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Decision Trees.

Artificial Intelligence Programming Probability

Machine Learning, Fall 2012 Homework 2

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

An AI-ish view of Probability, Conditional Probability & Bayes Theorem

10/18/2017. An AI-ish view of Probability, Conditional Probability & Bayes Theorem. Making decisions under uncertainty.

Bayesian Learning. Instructor: Jesse Davis

Decision Tree Learning

Naïve Bayes Classifiers

Probabilistic Graphical Models: Representation and Inference

Bayes Nets: Independence

Bayesian Networks BY: MOHAMAD ALSABBAGH

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Transcription:

1 / 28 ian ian Optimal (Adapted from Ethem Alpaydin and Tom Mitchell) Naïve Nets sscott@cse.unl.edu

2 / 28 ian Optimal Naïve Nets Might have reasons (domain information) to favor some hypotheses/predictions over others a priori ian methods work with probabilities, and have two main roles: 1 Provide practical learning algorithms: Naïve learning ian belief network learning Combine prior knowledge (prior probabilities) with observed data Requires prior probabilities 2 Provides useful conceptual framework Provides gold standard for evaluating other learning algorithms Additional insight into Occam s razor

3 / 28 ian Optimal Example optimal classifier Naïve classifier Example: Learning over text data ian belief networks Naïve Nets

4 / 28 Note: P(r D) increases with P(D r) and P(r) and decreases with P(D) ian Example Optimal Naïve Nets We want to know the probability that a particular label r is correct given that we have seen data D Conditional probability: P(r D) = P(r D)/P(D) theorem: P(r D) = P(D r)p(r) P(D) P(r) = prior probability of label r (might include domain information) P(D) = probability of data D P(r D) = posterior probability of r given D P(D r) = probability (aka likelihood) of D given r

Example ian Example Optimal Naïve Nets 5 / 28 Does a patient have cancer or not? A patient takes a lab test and the result is positive. The test returns a correct positive result in 98% of the cases in which the disease is actually present, and a correct negative result in 97% of the cases in which the disease is not present. Furthermore, 0.008 of the entire population have this cancer. P(cancer) = P( cancer) = P(+ cancer) = P( cancer) = P(+ cancer) = P( cancer) = Now consider new patient for whom the test is positive. What is our diagnosis? P(+ cancer)p(cancer) = P(+ cancer)p( cancer) = So diagnosis is

Basic for Probabilities ian Optimal Naïve Nets 6 / 28 Product Rule: probability P(A B) of a conjunction of two events A and B: P(A B) = P(A B)P(B) = P(B A)P(A) Sum Rule: probability of a disjunction of two events A and B: P(A B) = P(A) + P(B) P(A B) of total probability: if events A 1,..., A n are mutually exclusive with n i=1 P(A i) = 1, then P(B) = n P(B A i )P(A i ) i=1

7 / 28 Optimal ian Optimal Naïve Nets rule lets us get a handle on the most probable label for an instance optimal classification of instance x: argmax P(r j x) r j R where R is set of possible labels (e.g., R = {+, }) On average, no other classifier using same prior information and same hypothesis space can outperform optimal! Gold standard for classification

Optimal Applying Rule ian Optimal Naïve Nets 8 / 28 Let instance x be described by attributes x 1, x 2,..., x n Then, most probable label of x is: r = argmax r j R = argmax r j R = argmax r j R P(r j x 1, x 2,..., x n ) P(x 1, x 2,..., x n r j ) P(r j ) P(x 1, x 2,..., x n ) P(x 1, x 2,..., x n r j ) P(r j ) In other words, if we can estimate P(r j ) and P(x 1, x 2,..., x n r j ) for all possibilities, then we can give a optimal prediction of the label of x for all x How do we estimate P(r j ) from training data? What about P(x 1, x 2,..., x n r j )?

Naïve ian Optimal Naïve The Algorithm Example Subtleties Application Nets 9 / 28 Problem: Estimating P(r j ) easily done, but there are exponentially many combinations of values of x 1,..., x n E.g., if we want to estimate P(Sunny, Hot, High, Weak PlayTennis = No) from the data, need to count among the No labeled instances how many exactly match x (few or none) Naïve assumption: P(x 1, x 2,..., x n r j ) = i so naïve classifier: r NB = argmax r j R P(r j ) i P(x i r j ) P(x i r j ) Now have only polynomial number of probs to estimate

10 / 28 Naïve (cont d) ian Optimal Naïve The Algorithm Example Subtleties Application Nets Along with decision trees, neural networks, nearest neighbor, SVMs, boosting, one of the most practical learning methods When to use Moderate or large training set available Attributes that describe instances are conditionally independent given classification Successful applications: Diagnosis Classifying text documents

11 / 28 Naïve Algorithm ian Optimal Naïve The Algorithm Example Subtleties Application Naïve Learn For each target value r j 1 ˆP(r j ) estimate P(r j ) = fraction of exs with r j 2 For each attribute value v ik of each attrib x i x ˆP(v ik r j) estimate P(v ik r j) = fraction of r j-labeled instances with v ik Classify New Instance(x) r NB = argmax r j R ˆP(r j ) xi x ˆP(x i r j ) Nets

12 / 28 Naïve Example ian Optimal Naïve The Algorithm Example Subtleties Application Training Examples (Mitchell, Table 3.2): Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Instance x to classify: Outlk = sun, Temp = cool, Humid = high, Wind = strong Nets

13 / 28 Naïve Example (cont d) ian Optimal Naïve The Algorithm Example Subtleties Application Assign label r NB = argmax rj R P(r j) i P(x i r j ) P(y) P(sun y) P(cool y) P(high y) P(strong y) = (9/14) (2/9) (3/9) (3/9) (3/9) = 0.0053 P(n) P(sun n) P(cool n) P(high n) P(strong n) = (5/14) (3/5) (1/5) (4/5) (3/5) = 0.0206 So v NB = n Nets

14 / 28 Naïve Subtleties ian Optimal Naïve The Algorithm Example Subtleties Application Conditional independence assumption is often violated, i.e., P(x 1, x 2,..., x n r j ) i P(x i r j ), but it works surprisingly well anyway. Note that we don t need estimated posteriors ˆP(r j x) to be correct; need only that argmax r j R ˆP(r j ) i ˆP(x i r j ) = argmax r j R P(r j )P(x 1,..., x n r j ) Sufficient conditions given in Domingos & Pazzani [1996] Nets

Naïve Subtleties (cont d) ian Optimal Naïve The Algorithm Example Subtleties Application Nets 15 / 28 What if none of the training instances with target value r j have attribute value v ik? Then ˆP(v ik r j ) = 0, and ˆP(r j ) i Typical solution is to use m-estimate: where ˆP(v ik r j ) n c + mp n + m ˆP(v ik r j ) = 0 n is number of training examples for which r = r j, n c number of examples for which r = r j and x i = v ik p is prior estimate for ˆP(v ik r j ) m is weight given to prior (i.e., number of virtual examples) Sometimes called pseudocounts

16 / 28 Naïve Application: Text Classification ian Optimal Naïve The Algorithm Example Subtleties Application Nets Target concept Spam? : Document {+, } Each document is a vector of words (one attribute per word position), e.g., x 1 = each, x 2 = document, etc. Naïve very effective despite obvious violation of conditional independence assumption ( words in an email are independent of those around them) Set P(+) = fraction of training emails that are spam, P( ) = 1 P(+) To simplify matters, we will assume position independence, i.e., we only model the words in spam/not spam, not their position For every word w in our vocabulary, P(w +) = probability that w appears in any position of +-labeled training emails (factoring in prior m-estimate)

17 / 28 Naïve Application: Text Classification Pseudocode [Mitchell] ian Optimal Naïve The Algorithm Example Subtleties Application Nets

ian Belief Networks ian Optimal Naïve Nets Conditional Indep Definition Generative Predicting Labels Learning of BNs Summary 18 / 28 Sometimes naïve assumption of conditional independence too restrictive But inferring probabilities is intractable without some such assumptions ian belief networks (also called Nets) describe conditional independence among subsets of variables Allows combining prior knowledge about dependencies among variables with observed training data

ian Belief Networks Conditional Independence ian Optimal Naïve Nets Conditional Indep Definition Generative Predicting Labels Learning of BNs Summary 19 / 28 Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z; that is, if ( x i, y j, z k ) P(X = x i Y = y j, Z = z k ) = P(X = x i Z = z k ) more compactly, we write P(X Y, Z) = P(X Z) Example: Thunder is conditionally independent of Rain, given Lightning P(Thunder Rain, Lightning) = P(Thunder Lightning) Naïve uses conditional independence and product rule to justify P(X, Y Z) = P(X Y, Z) P(Y Z) = P(X Z) P(Y Z)

ian Belief Networks Definition ian Lightning Storm BusTourGroup Campfire C C S,B 0.4 0.6 S, B 0.1 0.9 S,B 0.8 0.2 S, B 0.2 0.8 Campfire Thunder ForestFire Optimal Naïve Nets Conditional Indep Definition Generative Predicting Labels Learning of BNs Summary 20 / 28 Network (directed acyclic graph) represents a set of conditional independence assertions: Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors E.g., Given Storm and BusTourGroup, Campfire is CI of Lightning and Thunder

ian Belief Networks Special Case ian Since each node is conditionally independent of its nondescendants given its immediate predecessors, what model does this represent, given that C is class and x i s are attributes? Optimal Naïve Nets Conditional Indep Definition Generative Predicting Labels Learning of BNs Summary 21 / 28

ian Belief Networks Causality ian Can think of edges in a net as representing a causal relationship between nodes Optimal Naïve Nets Conditional Indep Definition Generative Predicting Labels Learning of BNs Summary 22 / 28 E.g., rain causes wet grass Probability of wet grass depends on whether there is rain

ian Belief Networks Generative ian Optimal Naïve Nets Conditional Indep Definition Generative Predicting Labels Learning of BNs Summary 23 / 28 Lightning Thunder Storm Campfire ForestFire BusTourGroup Represents joint probability distribution over variables Y 1,..., Y n, e.g., P(Storm, BusTourGroup,..., ForestFire) C C S,B 0.4 0.6 S, B 0.1 0.9 Campfire In general, for y i = value of Y i n P(y 1,..., y n ) = P(y i Parents(Y i )) i=1 where Parents(Y i ) denotes immediate predecessors of Y i in graph E.g., P(S, B, C, L, T, F) = P(S) P(B) P(C B, S) P( L S) P( T L) P( F S, L, C) }{{} 0.4 S,B 0.8 0.2 S, B 0.2 0.8

ian Belief Networks Predicting Most Likely Label ian Optimal Naïve Nets Conditional Indep Definition Generative Predicting Labels Learning of BNs Summary 24 / 28 We sometimes call nets generative (vs discriminative) models since they can be used to generate instances Y 1,..., Y n according to joint distribution Can use for classification Label r to predict is one of the variables, represented by a node If we can determine the most likely value of r given the rest of the nodes, can predict label One idea: Go through all possible values of r, and compute joint distribution (previous slide) with that value and other attribute values, then return one that maximizes

ian Belief Networks Predicting Most Likely Label (cont d) ian Optimal Naïve Nets Conditional Indep Definition Generative Predicting Labels Learning of BNs Summary 25 / 28 Lightning Thunder Storm Campfire ForestFire BusTourGroup E.g., if Storm (S) is the label to predict, and we are given values of B, C, L, T, and F, can use formula to compute P(S, B, C, L, T, F) and P( S, B, C, L, T, F), then predict more likely one C C 0.4 0.6 0.1 0.9 Campfire Easily handles unspecified attribute values Issue: Takes time exponential in number of values of unspecified attributes More efficient approach: Pearl s message passing algorithm for chains and trees and polytrees (at most one path between any pair of nodes) S,B S, B S,B 0.8 0.2 S, B 0.2 0.8

Learning of ian Belief Networks ian Optimal Naïve Nets Conditional Indep Definition Generative Predicting Labels Learning of BNs Summary 26 / 28 Several variants of this learning task Network structure might be known or unknown Training examples might provide values of all network variables, or just some If structure known and all variables observed, then it s as easy as training a naïve classifier: Initialize CPTs with pseudocounts If, e.g., a training instance has set S, B, and C, then increment that count in C s table Probability estimates come from normalizing counts S, B S, B S, B S, B C 4 1 8 2 C 6 10 2 8

Learning of ian Belief Networks (cont d) ian Optimal Naïve Nets Conditional Indep Definition Generative Predicting Labels Learning of BNs Summary 27 / 28 Suppose structure known, variables partially observable E.g., observe ForestFire, Storm, BusTourGroup, Thunder, but not Lightning, Campfire Similar to training neural network with hidden units; in fact can learn network conditional probability tables using gradient ascent Converge to network h that (locally) maximizes P(D h), i.e., maximum likelihood hypothesis Can also use EM (expectation maximization) algorithm Use observations of variables to predict their values in cases when they re not observed EM has many other applications, e.g., hidden Markov models (HMMs)

ian Belief Networks Summary ian Optimal Naïve Combine prior knowledge with observed data Impact of prior knowledge (when correct!) is to lower the sample complexity Active research area Extend from boolean to real-valued variables Parameterized distributions instead of tables More effective inference methods Nets Conditional Indep Definition Generative Predicting Labels Learning of BNs Summary 28 / 28