Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Similar documents
CSCE 478/878 Lecture 6: Bayesian Learning

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Notes on Machine Learning for and

Stephen Scott.

Lecture 9: Bayesian Learning

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Two Roles for Bayesian Methods

Introduction to Machine Learning

Introduction to Machine Learning

MODULE -4 BAYEIAN LEARNING

BAYESIAN LEARNING. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6]

Machine Learning. CS Spring 2015 a Bayesian Learning (I) Uncertainty

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Bayesian Learning Features of Bayesian learning methods:

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Machine Learning. Bayesian Learning.

Bayesian Learning Extension

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

Bayesian Learning. Bayesian Learning Criteria

Naïve Bayes classification

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

Bayesian Learning. Remark on Conditional Probabilities and Priors. Two Roles for Bayesian Methods. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Uncertainty and Bayesian Networks

[read Chapter 2] [suggested exercises 2.2, 2.3, 2.4, 2.6] General-to-specific ordering over hypotheses

Probability Based Learning

Machine Learning. Bayesian Learning. Acknowledgement Slides courtesy of Martin Riedmiller

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

Machine Learning (CS 567)

10/15/2015 A FAST REVIEW OF DISCRETE PROBABILITY (PART 2) Probability, Conditional Probability & Bayes Rule. Discrete random variables

CS6220: DATA MINING TECHNIQUES

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Introduction to Bayesian Learning. Machine Learning Fall 2018

PROBABILISTIC REASONING SYSTEMS

The Noisy Channel Model and Markov Models

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Artificial Intelligence Programming Probability

Basic Probability. Robert Platt Northeastern University. Some images and slides are used from: 1. AIMA 2. Chris Amato

Probabilistic Classification

Statistical learning. Chapter 20, Sections 1 4 1

Bayesian Networks BY: MOHAMAD ALSABBAGH

Statistical Learning. Philipp Koehn. 10 November 2015

Outline. Training Examples for EnjoySport. 2 lecture slides for textbook Machine Learning, c Tom M. Mitchell, McGraw Hill, 1997

Quantifying uncertainty & Bayesian networks

Why Probability? It's the right way to look at the world.

Uncertainty. Chapter 13

13.4 INDEPENDENCE. 494 Chapter 13. Quantifying Uncertainty

PROBABILITY. Inference: constraint propagation

COMP 328: Machine Learning

BITS F464: MACHINE LEARNING

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Introduction to Bayesian Learning

The Naïve Bayes Classifier. Machine Learning Fall 2017

Expectation Maximization (EM)

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Bayesian Learning. Bayes Theorem. MAP, MLhypotheses. MAP learners. Minimum description length principle. Bayes optimal classier. Naive Bayes learner

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

Algorithm-Independent Learning Issues

Uncertainty. Chapter 13

Artificial Intelligence

Confusion matrix. a = true positives b = false negatives c = false positives d = true negatives 1. F-measure combines Recall and Precision:

Algorithmisches Lernen/Machine Learning

Statistical learning. Chapter 20, Sections 1 3 1

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Decision Tree Learning

Directed Graphical Models or Bayesian Networks

Pengju XJTU 2016

In today s lecture. Conditional probability and independence. COSC343: Artificial Intelligence. Curse of dimensionality.

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

T Machine Learning: Basic Principles

Artificial Intelligence CS 6364

Bayesian Networks. Motivation

Machine Learning. Bayesian Learning. Michael M. Richter. Michael M. Richter

Artificial Intelligence Uncertainty

Maxent Models and Discriminative Estimation

Lecture 15. Probabilistic Models on Graph

CS6220: DATA MINING TECHNIQUES

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Bayesian networks. Chapter Chapter

Learning with Probabilities

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Some Concepts of Probability (Review) Volker Tresp Summer 2018

1. True/False (40 points) total

Uncertainty. Logic and Uncertainty. Russell & Norvig. Readings: Chapter 13. One problem with logical-agent approaches: C:145 Artificial

Recall from last time: Conditional probabilities. Lecture 2: Belief (Bayesian) networks. Bayes ball. Example (continued) Example: Inference problem

Y. Xiang, Inference with Uncertain Knowledge 1

Lecture 2: Foundations of Concept Learning

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Probabilistic Reasoning. Kee-Eung Kim KAIST Computer Science

Model Averaging (Bayesian Learning)

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Transcription:

Examples My mood can take 2 possible values: happy, sad. The weather can take 3 possible vales: sunny, rainy, cloudy My friends know me pretty well and say that: P(Mood=happy Weather=rainy) = 0.25 P(Mood=happy Weather=sunny) =0.4 P(Mood=happy Weather=cloudy) = 0.05 Can you compute: P(Mood=happy)? P(Mood=sad)? P(Weather=rainy)? Bayesian learning algorithms use probability theory as an approach to concept classification. Bayesian approaches typically do not generate an explicit hypothesis. Statistical data is used to represent implicit hypotheses. Bayesian classifiers generally produce probabilities for (possibly multiple) class assignments, rather than a single definite classification. To use Bayesian techniques, we often need to make independence assumptions that often aren t valid (but we make them anyway!). 4 1 Conditional Probability P(A B) represents the probability of A given that B isknowntobetrue. We call this a conditional or posterior probability. P(A B) = 1 is equivalent to B A. For example, suppose Rover rarely howls: P(Rover howls) = 0.01 But when there is a full moon, he always howls! Then P(Rover howls full moon) =1.0 Two Roles for Bayesian Methods 1. Provides practical learning algorithms Naive Bayes learning Bayesian belief network learning Combine prior knowledge (prior probabilities) with observed data Requires prior probabilities 2. Provides useful conceptual framework Provides gold standard for evaluating other learning algorithms Additional insight into Occam s razor 5 2 The Chain Rule Prior Probability and Random Variables P (A B) = P (A B) P (B) We can rewrite this as: P (A B) =P (A B) P (B) which is called the chain rule because we can chain together probabilities to compute the likelihood of conjunctions. Example: P (A) =0.5 P (B A) =0.6 P (C A B) =0.8 What is P (A B C)? P(A) represents the prior or unconditional probability that statement A is true, in the absence of other information. A random variable represents different outcomes of an event. Each random variable has a domain of possible values x 1...x n,which each have a probability. The probabilities for all possible outcomes must sum to 1. For example: P(Disease=CAVITY) = 0.5 P(Disease=GUM DISEASE) = 0.3 P(Disease=IMPACTED TOOTH) = 0.1 P(Disease=ROOT INFECTION) = 0.1 6 3

Basic Formulas for Probabilities Bayes Theorem applied to machine learning P (h D) = P (h) = prior probability of hypothesis h = prior probability of training data D P (h D) = probability of h given D P (D h) = probability of D given h Product (or Chain ) Rule: probability of a conjunction of two events: P (A B) =P (A B)P (B) =P (B A)P (A) Sum Rule: probability of a disjunction of two events: P (A B) =P (A)+P(B) P (A B) Theorem of total probability: if events A 1,...,A n are mutually n exclusive with P (A i )=1, then n P (B) = P (B A i )P (A i ) 10 7 Bayes Rule Increasing Probability with more Data Simple form: P( h) P(h D1) P(h D1,D2) P (A B) = P (A) P (B A) P (B) hypotheses hypotheses hypotheses ( a) ( b) ( c) General form: P (A B,X) = P (A X) P (B A,X) P (B X) 11 8 Choosing Hypotheses P (h D) = Generally want the most probable hypothesis given the training data. Maximum a posteriori hypothesis h MAP : P (h D) Bayes Rule Example Let S = patient has a stiff neck, M = disease is meningitis. Assume that: P(S M) =0.5 P(M) = 1/5000 P(S) = 1/20 If assume P (h i )=P(h j ) then can further simplify. The Maximum Likelihood hypothesis: h ML = argmax P (D h i ) P(M S) =? 12 9

Most Probable Classification of New Instances So far we ve sought the most probable hypothesis given the data D (i.e., h MAP ) Given new instance x, what is its most probable classification? h MAP (x) is not the most probable classification! Consider: Three possible hypotheses: P (h 1 D) =.4, P(h 2 D) =.3, P(h 3 D) =.3 Given new instance x h 1 (x) =+,h 2 (x) =, h 3 (x) = Brute Force MAP Hypothesis Learner For each hypothesis h in H, calculate the posterior probability: P (h D) = Output the hypothesis h MAP with the highest posterior probability: P (h D) What s most probable classification of x? 16 13 Bayes Optimal Classifier Example: therefore and argmax P (h i D)P (c j h i ) P (h 1 D) =.4, P( h 1 )=0, P(+ h 1 )=1 P (h 2 D) =.3, P( h 2 )=1, P(+ h 2 )=0 P (h 3 D) =.3, P( h 3 )=1, P(+ h 3 )=0 P (h i D)P (+ h i ) =.4 P (h i D)P ( h i ) =.6 arg max v j V P (v j h i )P (h i D) = Relation to Concept Learning Consider our usual concept learning task Instance space X, hypothesis space H, training examples D Consider the FINDS learning algorithm (outputs most specific hypothesis from the version space VS H,D ) What would Bayes rule produce as the MAP hypothesis? Does FINDS output a MAP hypothesis?? 17 14 Naive Bayes Classifier One of the most practical learning methods, along with decision trees, neural networks, nearest neighbor methods, etc. Requires: 1. Moderate or large training set. 2. Attributes that describe instances should be conditionally independent given the classification. Successful applications include diagnosis and text classification. Relation to Concept Learning, II Assume fixed set of instances x 1,...,x n Assume D is the set of classifications c(x 1 ),...,c(x n) Choose P (D h) P (D h) =1if h consistent with D P (D h) =0otherwise Choose P (h) to be uniform distribution: P (h) = 1 for all h H H Then, 1 if h is consistent with D P (h D) = VS H,D 0 otherwise 18 15

Sparse Data What if none of the training instances with class c j have attribute value a i? Then ˆP (a i c j )=0 Naive Bayes Classifier Assume target function f : X C, where each instance x is described by attributes a 1,a 2...a n. so ˆP (c j ) i ˆP (a i c j )=0!!! Most probable value of f(x) is: Typical solution is m-estimate for ˆP (a i c j ) nc + mp ˆP (a i c j ) n + m where n is number of training examples for which c = c j n c number of examples for which c = c j and a = a i p is prior estimate for ˆP (a i c j ) m is weight given to prior (i.e. # of virtual examples) c MAP = argmax c MAP = argmax c MAP = argmax P (c j a 1,a 2...a n) P (a 1,a 2...a n c j )P (c j ) P (a 1,a 2...a n) P (a 1,a 2...a n c j )P (c j ) 22 19 Naive Bayes Example Consider a medical diagnosis problem with 3 possible diagnoses (Well, Cold, Allergy) and 3 symptoms (sneeze, cough, fever). Diagnosis Well Cold Allergy P (d) 0.9 0.05 0.05 P (sneeze d) 0.1 0.9 0.9 P (cough d) 0.1 0.8 0.7 P (fever d) 0.01 0.7 0.4 P (well sneeze cough fever)= Naive Bayes assumption To make the problem tractable, we often need to make the following independence assumption: P (a 1,a 2...a n c j )= P (a i c j ) i which allows us to define the Naive Bayes Classifier: c NB = argmax P (c j ) i P (a i c j ) 23 20 Example, II Diagnosis Well Cold Allergy P (d) 0.9 0.05 0.05 P (sneeze d) 0.1 0.9 0.9 P (cough d) 0.1 0.8 0.7 P (fever d) 0.01 0.7 0.4 P (cold sneeze cough fever)= Naive Bayes Algorithm Naive Bayes Learn(examples) For each possible class c j ˆP (c j ) estimate P (c j ) For each attribute value a i of each attribute a ˆP (a i c j ) estimate P (a i c j ) P (allergy sneeze cough fever)= P (sneeze cough fever)= Classify New Instance(x) c NB = argmax ˆP (c j ) ˆP (a i c j ) a i x 24 21

Bayesian Belief Networks Naive Bayes conditional independence assumption sometimes still too restrictive But it s impossible to estimate all parameters without some such assumptions Bayesian Belief Networks describe conditional independence among subsets of variables Allows combining prior knowledge about (in)dependencies with observed training data Comments on Naive Bayes Generally works quite well despite the blanket assumption of independence Experiments show it can be quite competitive with other methods (C4.5) on standard data sets Although it does not estimate probabilities accurately when independence is violated, still often picks correct category Does not perform search! Has a fairly strong bias Hypothesis not guaranteed to fit the training data Handles noise well 28 25 Conditional Independence Definition: X is conditionally independent of Y given Z if the probability distribution governing X is independent of the value of Y given the value of Z; that is, if ( x i,y j,z k )P (X = x i Y = y j,z = z k )=P(X = x i Z = z k ) Abbreviated: P (X Y,Z)=P (X Z) Example: Thunder is conditionally independent of Rain, given Lightning P (Thunder Rain, Lightning) =P (Thunder Lightning) Naive Bayes uses this to justify: P (X, Y Z) =P (X Y,Z)=P (X Z)P (Y Z) Minimum Description Length Principle Occam s razor: prefer the shortest hypothesis MDL: prefer the hypothesis that minimizes: h MDL = argmin L C 1 (h)+l C2 (D h) where L C (x) is the description length of x under encoding C For example: H=decision trees, D=labels from training data L C1 (h) is # bits to describe tree h L C2 (D h) is # bits to describe D given h Note: L C2 (D h) =0if examples classified perfectly by h. Need only describe exceptions Hence h MDL trades off tree size for training errors 29 26 Bayesian Belief Network MDL, Continued Lightning Storm BusTourGroup Campfire C C S,B 0.4 0.6 S, B S,B S, B 0.1 0.8 0.9 0.2 0.2 0.8 h MAP = argmin log 2 P (D h)+log 2 P (h) log 2 P (D h) log 2 P (h) (1) Thunder ForestFire Campfire Network represents a set of conditional independence assertions: Each node is asserted to be conditionally independent of its nondescendants, given its immediate predecessors Directed acyclic graph Each node has an associated conditional probability table Interesting fact from information theory: The optimal (shortest expected coding length) code for an event with probability p is log 2 p bits. So interpret (1): log 2 P (h) is length of h under optimal code log 2 P (D h) is length of D given h under optimal code By Occam s razor, prefer the hypothesis that minimizes length(h) + length(misclassifications) 30 27

Summary: Bayesian Belief Networks Combine prior knowledge with observed data Impact of prior knowledge (when correct!) is to lower the sample complexity Active research area: Extend from boolean to real-valued variables Parameterized distributions instead of tables Extend to first-order instead of propositional systems More effective inference methods... Interpretation The net and its associated table represents the joint probability distribution over all variables and their values In general, n P (y 1,...,y k )= P (y i Parents(Y i )) where Parents(Y i ) denotes immediate predecessors of Y i in the graph Therefore, joint distribution is fully defined by graph, plus the conditional probabilities Example: P (S, L, T, C, F, B) =? 34 31 Statistical Speech Recognition Model What is the most likely sequence of words that the speech signal represents? P (words signal) = P (words)p (signal words) P (signal) P (words) is the language model. For example, fat cat is more likely than hat cat. P (signal words) is the acoustic model. For example, cat is likely to be pronounced as kæt. Learning of Bayesian Networks Several variants of this learning task Network structure might be known or unknown Training examples might provide values of all network variables, or just some If structure known and all variables observed Then it reduces to training as with a Naive Bayes classifier P (signal) is the likelihood of the speech signal, but this is constant across all interpretations of the signal so we can ignore it. 35 32 The Language Model (P (words)) Ideally, we d like a complete model of the English language to tell us the exact probability of a sentence. But there is no complete model of English. (Or any other natural language.) You d need to know the probability of every sentence that could ever be uttered! A grammar should be helpful, but there is no complete grammar for English and spoken language is notoriously ungrammatical anyway. Learning Bayes Nets Suppose structure known, variables partially observable Similar to training neural network with hidden units In fact, can learn Bayes Net conditional probability tables using gradient ascent! Converge to network h that (locally) maximizes P (D h) Statistical speech recognition systems approximate the likelihood of sentences by collecting n-gram statistics of word usage from large spoken language corpora. 36 33

Summary: Probabilistic inference lends power and flexibility, can incorporate prior probabilistic knowledge Can make statistically invalid assumptions and still get reasonably good results (Naive Bayes) Provides insights into algorithms that don t even use probabilities (explicitly) Bayesian Belief Networks let us incorporate even more types of prior knowledge (than standard Bayes techniques) Many techniques that have proven their success in many application areas N-grams A unigram represents a single word. We estimate: P (w) = frequency of w in corpus / number of words in corpus. A bigram represents a pair of sequential words. We estimate: P (w 2 w 1) = frequency of w 2 following w 1 / frequency of w 1. A trigram represents a triple of sequential words. We estimate: P (w 3 w 1,w 2) = frequency of w 3 following w 1,w 2 / frequency of w 1,2. 40 37 Unigram/Bigram Statistics Word Unigram Previous words Count OF IN IS ON TO FROM THAT WITH LINE VISION THE 367 179 143 44 44 65 35 30 17 0 0 ON 69 0 0 1 0 0 0 0 0 0 0 OF 281 0 0 2 0 1 0 3 0 4 0 TO 212 0 0 19 0 0 0 0 0 0 1 IS 175 0 0 0 0 0 0 13 0 1 3 A 153 36 36 33 23 21 14 3 15 0 0 THAT 124 0 3 18 0 1 0 0 0 0 0 WE 105 0 0 0 1 0 0 12 0 0 0 LINE 17 1 0 0 0 1 0 0 0 0 0 VISION 13 3 0 0 1 0 1 0 0 0 0 38 The N-gram language model Ideally, we d like to compute: P (w 1...w n)=p (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 )...P (w n w 1...w n 1 ) But we rarely have the data necessary to compute those statistics reliably. So we estimate by making independence assumptions and using bigrams (or trigams). The bigram model is: P (w 1...w n)=p (w 1 )P (w 2 w 1 )P (w 3 w 2 )...P (w n w n 1 ) n P (w 1...w n)= P (w i w i 1 ) Does amazingly well at predicting the next word! 39