CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Similar documents
Stephen Scott.

CSCE 478/878 Lecture 6: Bayesian Learning

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

BAYESIAN LEARNING. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6]

Bayesian Learning Features of Bayesian learning methods:

Lecture 9: Bayesian Learning

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

COMP 328: Machine Learning

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Machine Learning. Bayesian Learning.

Bayesian Learning. Remark on Conditional Probabilities and Priors. Two Roles for Bayesian Methods. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.

Notes on Machine Learning for and

Bayesian Learning. Bayesian Learning Criteria

Machine Learning. Bayesian Learning. Acknowledgement Slides courtesy of Martin Riedmiller

The Naïve Bayes Classifier. Machine Learning Fall 2017

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Machine Learning (CS 567)

Naïve Bayes classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Two Roles for Bayesian Methods

Probability Based Learning

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

Bayesian Classification. Bayesian Classification: Why?

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

Learning Decision Trees

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

MODULE -4 BAYEIAN LEARNING

CS6220: DATA MINING TECHNIQUES

Algorithms for Classification: The Basic Methods

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Introduction to Machine Learning

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Introduction to Machine Learning

Learning Decision Trees

CS6220: DATA MINING TECHNIQUES

Decision Tree Learning and Inductive Inference

Mining Classification Knowledge

Machine Learning 2nd Edi7on

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Machine Learning. Bayesian Learning. Michael M. Richter. Michael M. Richter

Artificial Intelligence: Reasoning Under Uncertainty/Bayes Nets

Inference in Graphical Models Variable Elimination and Message Passing Algorithm

10-701/ Machine Learning: Assignment 1

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Probabilistic Classification

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

UVA CS / Introduc8on to Machine Learning and Data Mining

Uncertainty and Bayesian Networks

Confusion matrix. a = true positives b = false negatives c = false positives d = true negatives 1. F-measure combines Recall and Precision:

Computational Genomics

Machine Learning

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

T Machine Learning: Basic Principles

Lecture 3: Decision Trees

Building Bayesian Networks. Lecture3: Building BN p.1

Probabilistic Reasoning. (Mostly using Bayesian Networks)

CSCI-567: Machine Learning (Spring 2019)

Mining Classification Knowledge

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Decision Trees. Tirgul 5

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Learning Classification Trees. Sargur Srihari

Decision Trees.

Why Probability? It's the right way to look at the world.

Classification II: Decision Trees and SVMs

Decision Tree Learning

Naïve Bayes Classifiers

Bayesian Networks. Motivation

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Dan Roth 461C, 3401 Walnut

Decision Trees.

An AI-ish view of Probability, Conditional Probability & Bayes Theorem

10/18/2017. An AI-ish view of Probability, Conditional Probability & Bayes Theorem. Making decisions under uncertainty.

Machine Learning, Fall 2012 Homework 2

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Introduction to Bayesian Learning. Machine Learning Fall 2018

Ensemble Methods. Charles Sutton Data Mining and Exploration Spring Friday, 27 January 12

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Administration. Chapter 3: Decision Tree Learning (part 2) Measuring Entropy. Entropy Function

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Bayesian Networks Inference with Probabilistic Graphical Models

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Artificial Intelligence Programming Probability

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Decision Tree Learning - ID3

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Transcription:

ian ian ian Might have reasons (domain information) to favor some hypotheses/predictions over others a priori ian methods work with probabilities, and have two main roles: Naïve Nets (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu Naïve Nets 1 Provide practical learning algorithms: Naïve learning ian belief network learning ombine prior knowledge (prior probabilities) with observed data Requires prior probabilities 2 Provides useful conceptual framework Provides gold standard for evaluating other learning algorithms Additional insight into Occam s razor 1 / 28 2 / 28 ian Naïve Nets optimal classifier Naïve classifier : Learning over text data ian belief networks ian Naïve Nets We want to know the probability that a particular label r is correct given that we have seen data D onditional probability: P(r D) =P(r ^ D)/P(D) theorem: P(r D) = P(D r)p(r) P(D) P(r) = prior probability of label r (might include domain information) P(D) = probability of data D P(r D) = probability of r given D P(D r) = probability of D given r 3 / 28 4 / 28 Note: P(r D) increases with P(D r) and P(r) and decreases with P(D) Basic for Probabilities ian Naïve Nets 5 / 28 Does a patient have cancer or not? A patient takes a lab test and the result is positive. The test returns a correct positive result in 98% of the cases in which the disease is actually present, and a correct negative result in 97% of the cases in which the disease is not present. Furthermore, 0.008 of the entire population have this cancer. P(cancer) = P( cancer) = P(+ cancer) = P( cancer) = P(+ cancer) = P( cancer) = Now consider new patient for whom the test is positive. What is our diagnosis? P(+ cancer)p(cancer) = P(+ cancer)p( cancer) = So diagnosis is ian Naïve Nets 6 / 28 Product Rule: probability P(A ^ B) of a conjunction of two events A and B: P(A ^ B) =P(A B)P(B) =P(B A)P(A) Sum Rule: probability of a disjunction of two events A and B: P(A _ B) =P(A)+P(B) P(A ^ B) of total probability: if events A 1,...,A n are mutually exclusive with P n i=1 P(A i)=1, then P(B) = nx P(B A i )P(A i ) i=1

Applying Rule ian rule lets us get a handle on the most probable label for an instance optimal classification of instance x: argmax P(r j x) where R is set of possible labels (e.g., {+, }) ian Let instance x be described by attributes hx 1, x 2,...,x n i Then, most probable label of x is: r = argmax = argmax = argmax P(r j x 1, x 2,...,x n ) P(x 1, x 2,...,x n r j ) P(r j ) P(x 1, x 2,...,x n ) P(x 1, x 2,...,x n r j ) P(r j ) Naïve Nets 7 / 28 On average, no other classifier using same prior information and same hypothesis space can outperform optimal! ) Gold standard for classification Naïve Nets 8 / 28 In other words, if we can estimate P(r j ) and P(x 1, x 2,...,x n r j ) for all possibilities, then we can give a optimal prediction of the label of x for all x How do we estimate P(r j ) from training data? What about P(x 1, x 2,...,x n r j )? Naïve Naïve ian Naïve Nets 9 / 28 Problem: Estimating P(r j ) easily done, but there are exponentially many combinations of values of x 1,...,x n E.g., if we want to estimate P(Sunny, Hot, High, Weak PlayTennis = No) from the data, need to count among the No labeled instances how many exactly match x (few or none) Naïve assumption: P(x 1, x 2,...,x n r j )= Y i so naïve classifier: P(x i r j ) r NB = argmax P(r j ) Y P(x i r j ) i Now have only polynomial number of probs to estimate ian Naïve Nets 10 / 28 Along with decision trees, neural networks, nearest neighbor, SVMs, boosting, one of the most practical learning methods When to use Moderate or large training set available Attributes that describe instances are conditionally independent given classification Successful applications: Diagnosis lassifying text documents Naïve Algorithm Naïve ian Naïve Naïve Learn For each target value r j 1 ˆP(r j ) estimate P(r j )=fraction of exs with r j 2 For each attribute value v ik of each attrib x i 2 x ˆP(v ik r j) estimate P(v ik r j)=fraction of r j-labeled instances with v ik lassify New Instance(x) r NB = argmax ˆP(r j ) Y xi2x ˆP(x i r j ) ian Naïve Training s (Mitchell, Table 3.2): Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain ool Normal Weak Yes D6 Rain ool Normal Strong No D7 Overcast ool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny ool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Instance x to classify: houtlk = sun, Temp = cool, Humid = high, Wind = strongi Nets Nets 11 / 28 12 / 28

Naïve Naïve ian Naïve Assign label r NB = argmax P(r rj2r j) Q i P(x i r j ) P(y) P(sun y) P(cool y) P(high y) P(strong y) =(9/14) (2/9) (3/9) (3/9) (3/9) =0.0053 P(n) P(sun n) P(cool n) P(high n) P(strong n) =(5/14) (3/5) (1/5) (4/5) (3/5) =0.0206 So v NB = n ian Naïve onditional independence assumption is often violated, i.e., P(x 1, x 2,...,x n r j ) 6= Y i P(x i r j ), but it works surprisingly well anyway. Note that we don t need estimated posteriors ˆP(r j x) to be correct; need only that argmax rj2r ˆP(r j ) Y i ˆP(x i r j )=argmax rj2r P(r j )P(x 1,...,x n r j ) Sufficient conditions given in Domingos & Pazzani [1996] Nets Nets 13 / 28 14 / 28 Naïve Naïve : Text lassification ian Naïve Nets 15 / 28 What if none of the training instances with target value r j have attribute value v ik? Then ˆP(v ik r j )=0, and ˆP(r j ) Y i Typical solution is to use m-estimate: where ˆP(v ik r j ) n c + mp n + m ˆP(v ik r j )=0 n is number of training examples for which r = r j, n c number of examples for which r = r j and x i = v ik p is prior estimate for ˆP(v ik r j ) m is weight given to prior (i.e., number of virtual examples) Sometimes called pseudocounts ian Naïve Nets 16 / 28 Target concept Spam? :Document! {+, } Each document is a vector of words (one attribute per word position), e.g., x 1 = each, x 2 = document, etc. Naïve very effective despite obvious violation of conditional independence assumption () words in an email are independent of those around them) Set P(+) = fraction of training emails that are spam, P( )=1 P(+) To simplify matters, we will assume position independence, i.e., we only model the words in spam/not spam, not their position ) For every word w in our vocabulary, P(w +) = probability that w appears in any position of +-labeled training emails (factoring in prior m-estimate) Naïve : Text lassification Pseudocode [Mitchell] ian Belief Networks ian Naïve Nets ian Naïve Nets Generative Sometimes naïve assumption of conditional independence too restrictive But inferring probabilities is intractable without some such assumptions ian belief networks (also called Nets) describe conditional independence among subsets of variables Allows combining prior knowledge about dependencies among variables with observed training data 17 / 28 Summary 18 / 28

ian Belief Networks endence ian Belief Networks ian Naïve Nets Generative Summary 19 / 28 : X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z; that is, if (8x i, y j, z k ) P(X = x i Y = y j, Z = z k )=P(X = x i Z = z k ) more compactly, we write P(X Y, Z) =P(X Z) : is conditionally independent of Rain, given P( Rain, ) =P( ) Naïve uses conditional independence and product rule to justify P(X, Y Z) = P(X Y, Z) P(Y Z) = P(X Z) P(Y Z) ian Naïve Nets Generative Summary 20 / 28 S,B S, B S,B S, B Network represents a set of conditional independence assertions: Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors Directed acyclic graph ian Belief Networks Special ase ian Belief Networks ausality ian Since each node is conditionally independent of its nondescendants given its immediate predecessors, what model does this represent, given that is class and x i s are attributes? ian an think of edges in a net as representing a causal relationship between nodes Naïve Nets Naïve Nets E.g., rain causes wet grass Generative Generative Probability of wet grass depends on whether there is rain Summary 21 / 28 Summary 22 / 28 ian Belief Networks Generative ian Belief Networks Predicting Most Likely Label ian S,B S, B S,B S, B ian We sometimes call nets generative (vs discriminative) models since they can be used to generate instances hy 1,...,Y n i according to joint distribution Naïve Nets Generative Summary 23 / 28 Represents joint probability distribution over variables hy 1,...,Y n i, e.g., P(,,...,) In general, for y i = value of Y i ny P(y 1,...,y n )= P(y i Parents(Y i )) i=1 where Parents(Y i ) denotes immediate predecessors of Y i in graph E.g., P(S, B,, L, T, F) = P(S) P(B) P( B, S) P( L S) P( T L) P( F S, L, ) {z } Naïve Nets Generative Summary 24 / 28 an use for classification Label r to predict is one of the variables, represented by a node If we can determine the most likely value of r given the rest of the nodes, can predict label One idea: Go through all possible values of r, and compute joint distribution (previous slide) with that value and other attribute values, then return one that maximizes

ian Belief Networks Predicting Most Likely Label Learning of ian Belief Networks ian Naïve Nets Generative Summary 25 / 28 E.g., if (S) is the label to predict, and we are given values of B,, L, T, and F, can use formula to compute P(S, B,, L, T, F) and P( S, B,, L, T, F), then predict more likely one S,B S, B Easily handles unspecified attribute values Issue: Takes time exponential in number of values of unspecified attributes More efficient approach: Pearl s message passing algorithm for chains and trees and polytrees (at most one path between any pair of nodes) S,B S, B ian Naïve Nets Generative Summary 26 / 28 Several variants of this learning task Network structure might be known or unknown Training examples might provide values of all network variables, or just some If structure known and all variables observed, then it s as easy as training a naïve classifier: Initialize PTs with pseudocounts If, e.g., a training instance has set S, B, and, then increment that count in s table Probability estimates come from normalizing counts S, B S, B S, B S, B 4 1 8 2 6 10 2 8 Learning of ian Belief Networks ian Belief Networks Summary ian Naïve Nets Generative Suppose structure known, variables partially observable E.g., observe,,,, but not, Similar to training neural network with hidden units; in fact can learn network conditional probability tables using gradient ascent onverge to network h that (locally) maximizes P(D h), i.e., maximum likelihood hypothesis an also use EM (expectation maximization) algorithm Use observations of variables to predict their values in cases when they re not observed EM has many other applications, e.g., hidden Markov models (HMMs) ian Naïve Nets Generative ombine prior knowledge with observed data Impact of prior knowledge (when correct!) is to lower the sample complexity Active research area Extend from boolean to real-valued variables Parameterized distributions instead of tables More effective inference methods Summary 27 / 28 Summary 28 / 28