Probability Based Learning

Similar documents
Lecture 4: Probabilistic Learning

Lecture 4: Probabilistic Learning. Estimation Theory. Classification with Probability Distributions

Naïve Bayes classification

The Naïve Bayes Classifier. Machine Learning Fall 2017

CSCE 478/878 Lecture 6: Bayesian Learning

Lecture 9: Bayesian Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Algorithms for Classification: The Basic Methods

Stephen Scott.

COMP 328: Machine Learning

Bayesian Learning Features of Bayesian learning methods:

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Bayesian Methods: Naïve Bayes

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

MODULE -4 BAYEIAN LEARNING

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Notes on Machine Learning for and

Bayesian Learning. Bayesian Learning Criteria

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano

T Machine Learning: Basic Principles

Bayesian Classification. Bayesian Classification: Why?

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Mining Classification Knowledge

Mining Classification Knowledge

Statistical Learning. Philipp Koehn. 10 November 2015

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

BAYESIAN DECISION THEORY

Machine Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Algorithmisches Lernen/Machine Learning

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Introduction to Machine Learning

Introduction to Machine Learning

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Statistical learning. Chapter 20, Sections 1 3 1

Decision Trees. Tirgul 5

Bayesian Networks. Motivation

Learning Decision Trees

Learning with Probabilities

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

ECE521 week 3: 23/26 January 2017

COMP61011! Probabilistic Classifiers! Part 1, Bayes Theorem!

Introduction to Bayesian Learning

CSC321 Lecture 18: Learning Probabilistic Models

Decision Tree Learning - ID3

Machine Learning. Bayesian Learning.

A.I. in health informatics lecture 3 clinical reasoning & probabilistic inference, II *

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Statistical learning. Chapter 20, Sections 1 4 1

CS6220: DATA MINING TECHNIQUES

Naïve Bayes Lecture 6: Self-Study -----

10-701/ Machine Learning: Assignment 1

Probability Theory for Machine Learning. Chris Cremer September 2015

Why Probability? It's the right way to look at the world.

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Introduction to Bayesian Learning. Machine Learning Fall 2018

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Learning Decision Trees

Bayesian Networks Inference with Probabilistic Graphical Models

Bayesian Learning. Remark on Conditional Probabilities and Priors. Two Roles for Bayesian Methods. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.

Applied Logic. Lecture 4 part 2 Bayesian inductive reasoning. Marcin Szczuka. Institute of Informatics, The University of Warsaw

CSC 411: Lecture 09: Naive Bayes

Probabilistic Classification

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

From inductive inference to machine learning

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Probabilistic Graphical Models for Image Analysis - Lecture 1

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Data Mining Part 4. Prediction

Bayesian Machine Learning

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

Statistical learning. Chapter 20, Sections 1 3 1

BAYESIAN LEARNING. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6]

Soft Computing. Lecture Notes on Machine Learning. Matteo Mattecci.

Computer vision: models, learning and inference

Bayes Networks. CS540 Bryan R Gibson University of Wisconsin-Madison. Slides adapted from those used by Prof. Jerry Zhu, CS540-1

MLE/MAP + Naïve Bayes

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

COM336: Neural Computing

Be able to define the following terms and answer basic questions about them:

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Human-Oriented Robotics. Probability Refresher. Kai Arras Social Robotics Lab, University of Freiburg Winter term 2014/2015

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Machine Learning

Transcription:

Probability Based Learning Lecture 7, DD2431 Machine Learning J. Sullivan, A. Maki September 2013

Advantages of Probability Based Methods Work with sparse training data. More powerful than deterministic methods - decision trees - when training data is sparse. Results are interpretable. More transparent and mathematically rigorous than methods such as ANN, Evolutionary methods. Tool for interpreting other methods. Framework for formalizing other methods - concept learning, least squares.

Outline Probability Theory Basics Bayes rule MAP and ML estimation Minimum Description Length principle Naïve Bayes Classifier EM Algorithm

Probability Theory Basics

Random Variables A random variable x denotes a quantity that is uncertain the result of flipping a coin, the result of measuring the temperature The probability distribution P (x) of a randam variable (r.v.) captures the fact that the r.v. will have different values when observed and Some values occur more than others.

Random Variables 26 2 Introduction to probability A discrete random variable takes values from a predefined set. For a Boolean discrete random variable this Figure 2.1 Two different representations for discrete probabilities a) A bar predefined set has two members - graph representing the probability that a biased 6-sided {0, die 1}, lands on {yes, each no} etc. face. The height of the bar represents the probability: the sum of all heights is one. b) A Hinton diagram illustrating the probability of observing different weather types in England. The area of the square represents the probability, A continuous so the sum ofrandom all areas is one. variable takes values that 26 are real numbers. 2 Introduction to probability Figure 2.2 Continuous probability distribution (probability density function or pdf for short) for time taken to complete a test. Note that the probability density can exceed one, but the area under the curve must always have unit area. discrete pdf Figure 2.1 Two different representations for discrete probabilities a) A bar graph representing the probability that a biased 6-sided die lands on each 2.2 face. Joint The height probability of the bar represents the probability: the sum of all heights is one. b) A Hinton diagram illustrating the probability of observing different Figures taken from weather Computer types in England. Vision: The area models, of the square learning represents and the probability, inference by Simon Prince. so Consider the sum of two allrandom areas isvariables, one. x and y. If we observe multiple paired instances Problem 2.1 continuous pdf

Joint Probabilities Consider two random variables x and y. Observe multiple paired instances of x and y. Some paired outcomes will occur more frequently. 2 Introduction to probability This information is encoded in the joint probability distribution P (x, y). P (x) denotes the joint probability of x = (x 1,..., x K ). discrete joint pdf Figure from Computer Vision: models, learning and inference by Simon Prince.

Joint Probabilities (cont.) a) b) c) d) e) f) Figure from Computer Vision: models, learning and inference by Simon Prince.

Marginalization The probability distribution of any single variable can be recovered from a joint distribution by summing for the discrete case P (x) = y P (x, y) and integrating for the continuous case P (x) = P (x, y) dy y

Marginalization (cont.) a) b) c) Figure from Computer Vision: models, learning and inference by Simon Prince.

Conditional Probability The conditional probability of x given that y takes value y indicates the different values of r.v. x which we ll observe given that y is fixed to value y. The conditional probability can be recovered from the joint distribution P (x, y): P (x y = y ) = P (x, y = y ) P (x, y = y ) P (y = y = ) x P (x, y = y ) dx 2.4 Conditional probability 29 Extract an appropriate slice, and then normalize it. Figure 2.5 Conditional Probability. Joint pdf of x and y and two conditional Figure from Computer probability Vision: distributions models, Pr(x y learning = y1) andpr(x y and= inference y2). These are byformed Simon Prince.

Bayes Rule Bayes Rule P (y x) = P (x y)p (y) P (x) = P (x y)p (y) P (x y)p (y) y Each term in Bayes rule has a name: P (y x) Posterior (what we know about y given x.) P (y) Prior (what we know about y before we consider x.) P (x y) Likelihood (propensity for observing a certain value of x given a certain value of y) P (x) Evidence (a constant to ensure that the l.h.s. is a valid distribution)

Bayes Rule In many of our applications y is a discrete variable and x is a multi-dimensional data vector extracted from the world. Then P (y x) = P (x y)p (y) P (x) P (x y) Likelihood represents the probability of observing data x given the hypothesis y. P (y) Prior of y represents the background knowledge of hypothesis y being correct. P (y x) Posterior represents the probability that hypothesis y is true after data x has been observed.

Learning and Inference Bayesian Inference: The process of calculating the posterior probability distribution P (y x) for certain data x. Bayesian Learning: The process of learning the likelihood distribution P (x y) and prior probability distribution P (y) from a set of training points {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )}

Example: Which Gender? Task: Determine the gender of a person given their measured hair length. Notation: Let g { f, m } be a r.v. denoting the gender of a person. Let x be the measured length of the hair. Information given: The hair length observation was made at a boy s school thus P (g = m ) =.95, P (g = f ) =.05 Knowledge of the likelihood distributions P (x g = f ) and P (x g = m )

Example: Which Gender? Task: Determine the gender of a person given their measured hair length. Notation: Let g { f, m } be a r.v. denoting the gender of a person. Let x be the measured length of the hair. Information given: The hair length observation was made at a boy s school thus P (g = m ) =.95, P (g = f ) =.05 Knowledge of the likelihood distributions P (x g = f ) and P (x g = m )

Example: Which Gender? Task: Determine the gender of a person given their measured hair length. Notation: Example: Which Gender? Let g { f, m } be a r.v. denoting the gender of a person. Let x be the measured length of the hair. Information distributions over given: hair length for each class Given: classes h = {A, B}, A = men, B = women, Task: Determine the gender of a person given the hair length The hair length observation was made at a boy s school thus P (g = m ) =.95, P (g = f ) =.05 and it might be interesting to know that the observation of hair length is made in a boys school: P(A) = P A = 0.95, P(B) = P B = 0.05 Knowledge of the likelihood distributions P (x g = f ) and P (x g = m )

Example: Which Gender? Task: Determine the gender of a person given their measured hair length = calculate P (g x). Solution: Apply Bayes Rule to get P (x g = m )P (g = m ) P (g = m x) = P (x) P (x g = m )P (g = m ) = P (x g = f )P (g = f ) + P (x g = m )P (g = m ) Can calculate P (g = f x) = 1 P (g = m x)

Selecting the most probably hypothesis Maximum A Posteriori (MAP) Estimate: Hypothesis with highest probability given observed data y MAP = arg max y Y = arg max y Y = arg max y Y P (y x) P (x y) P (y) P (x) P (x y) P (y) Maximum Likelihood Estimate (MLE): Hypothesis with highest likelihood of generating observed data. y MLE = arg max y Y P (x y) Useful if we do not know prior distribution or if it is uniform.

Selecting the most probably hypothesis Maximum A Posteriori (MAP) Estimate: Hypothesis with highest probability given observed data y MAP = arg max y Y = arg max y Y = arg max y Y P (y x) P (x y) P (y) P (x) P (x y) P (y) Maximum Likelihood Estimate (MLE): Hypothesis with highest likelihood of generating observed data. y MLE = arg max y Y P (x y) Useful if we do not know prior distribution or if it is uniform.

Example: Cancer or Not? Scenario: A patient takes a lab test and the result comes back positive. The test returns a correct positive result in only 98% of the cases in which the disease is actually present, and a correct negative result in only 97% of the cases in which the disease is not present. Furthermore, 0.8% of the entire population have cancer. Scenario in probabilities: Priors: Likelihoods: P (disease) =.008 P (not disease) =.992 P (+ disease) =.98 P (+ not disease) =.03 P ( disease) =.02 P ( not disease) =.97

Example: Cancer or Not? Find MAP estimate: When test returned a positive result, y MAP = arg max y {disease, not disease} P (y +) = arg max P (+ y) P (y) y {disease, not disease} Substituting in the correct values get P (+ disease) P (disease) =.98.008 =.0078 P (+ not disease) P (not disease) =.03.992 =.0298 Therefore y MAP = "not disease". The Posterior probabilities:.0078 P (disease +) = (.0078 +.0298) =.21.0298 P (not disease +) = (.0078 +.0298) =.79

Example: Cancer or Not? Find MAP estimate: When test returned a positive result, y MAP = arg max y {disease, not disease} P (y +) = arg max P (+ y) P (y) y {disease, not disease} Substituting in the correct values get P (+ disease) P (disease) =.98.008 =.0078 P (+ not disease) P (not disease) =.03.992 =.0298 Therefore y MAP = "not disease". The Posterior probabilities:.0078 P (disease +) = (.0078 +.0298) =.21.0298 P (not disease +) = (.0078 +.0298) =.79

Example: Cancer or Not? Find MAP estimate: When test returned a positive result, y MAP = arg max y {disease, not disease} P (y +) = arg max P (+ y) P (y) y {disease, not disease} Substituting in the correct values get P (+ disease) P (disease) =.98.008 =.0078 P (+ not disease) P (not disease) =.03.992 =.0298 Therefore y MAP = "not disease". The Posterior probabilities:.0078 P (disease +) = (.0078 +.0298) =.21.0298 P (not disease +) = (.0078 +.0298) =.79

Relation to Occams s Razor Occam s Razor: Choose the simplest explanation for the observed data Information theoretic perspective Occam s razor corresponds to choosing the explanation requiring the fewest bits to represent. The optimal representation requires log 2 p(y x) bits to store. (Remember: the Shannon information content) Minimum description length principle: Choose hypothesis y MDL = arg min y Y log 2 P (y x) = arg min y Y log 2 P (x y) log 2 P (y) The MDL estimate is equal to the MAP estimate y MAP = arg max y Y log 2 P (x y) + log 2 P (y)

Relation to Occams s Razor Occam s Razor: Choose the simplest explanation for the observed data Information theoretic perspective Occam s razor corresponds to choosing the explanation requiring the fewest bits to represent. The optimal representation requires log 2 p(y x) bits to store. (Remember: the Shannon information content) Minimum description length principle: Choose hypothesis y MDL = arg min y Y log 2 P (y x) = arg min y Y log 2 P (x y) log 2 P (y) The MDL estimate is equal to the MAP estimate y MAP = arg max y Y log 2 P (x y) + log 2 P (y)

Naïve Bayes Classifier

Feature Space ources of Noise Sources of Noise Sensors give measurements which can be converted to features. Ideally a feature value is identical for all samples in one class. Samples Feature space

Feature Space Sources of Noise Sensors give measurements which can be converted to features. However in the real world Samples because of Measurement noise Intra-class variation Poor choice of features Feature space 25 25

Feature Space Feature Space End result: a K dimensional space in which each dimension is a feature containing n labelled samples (objects) 26

Problem: Large Feature Space Size of feature space exponential in number of features. More features = potential for better description of the objects but... More features = more difficult to model P (x y). Extreme Solution: Naïve Bayes classifier All features (dimensions) regarded as independent. Model k one-dimensional distributions instead of one k-dimensional distribution.

Problem: Large Feature Space Size of feature space exponential in number of features. More features = potential for better description of the objects but... More features = more difficult to model P (x y). Extreme Solution: Naïve Bayes classifier All features (dimensions) regarded as independent. Model k one-dimensional distributions instead of one k-dimensional distribution.

Naïve Bayes Classifier One of the most common learning methods. When to use: Moderate or large training set available. Features x i of a data instance x are conditionally independent given classification (or at least reasonably independent, still works with a little dependence). Successful applications: Medical diagnoses (symptoms independent) Classification of text documents (words independent)

Naïve Bayes Classifier x is a vector (x 1,..., x K ) of attribute or feature values. Let Y = {1, 2,..., Y } be the set of possible classes. The MAP estimate of y is y MAP = arg max y Y P (y x 1,..., x K ) = arg max y Y P (x 1,..., x K y) P (y) P (x 1,..., x K ) = arg max y Y P (x 1,..., x K y) P (y) Naïve Bayes assumption: P (x 1,..., x K y) = K k=1 P (x k y) This give the Naïve Bayes classifier: y MAP = arg max P (y) K P (x k y) y Y k=1

Naïve Bayes Classifier x is a vector (x 1,..., x K ) of attribute or feature values. Let Y = {1, 2,..., Y } be the set of possible classes. The MAP estimate of y is y MAP = arg max y Y P (y x 1,..., x K ) = arg max y Y P (x 1,..., x K y) P (y) P (x 1,..., x K ) = arg max y Y P (x 1,..., x K y) P (y) Naïve Bayes assumption: P (x 1,..., x K y) = K k=1 P (x k y) This give the Naïve Bayes classifier: y MAP = arg max P (y) K P (x k y) y Y k=1

Example: Play Tennis? Question: Will I go and play tennis given the forecast? My measurements: 1 forecast {sunny, overcast, rainy}, 2 temperature {hot, mild, cool}, 3 humidity {high, normal}, 4 windy {false, true}. Possible decisions: y {yes, no}

Example: Play Tennis? : What Play I did in Tennis? the past: nd ata

Example: Play Tennis? Counts of when I played tennis (did not play) Outlook Temperature Humidity Windy sunny overcast rain hot mild cool high normal false true 2 (3) 4 (0) 3 (2) 2 (2) 4 (2) 3 (1) 3 (4) 6 (1) 6 (2) 3 (3) Prior of whether I played tennis or not Counts: Play yes no 9 5 Prior Probabilities: Play yes no 9 14 5 14 Likelihood of attribute when tennis played P (x i y=yes)(p (x i y=no)) Outlook Temperature Humidity Windy sunny overcast rain hot mild cool high normal false true 2 9 ( 3 5 ) 4 9 ( 0 5 ) 3 9 ( 2 5 ) 2 9 ( 2 5 ) 4 9 ( 2 5 ) 3 9 ( 1 5 ) 3 9 ( 4 5 ) 6 9 ( 1 5 ) 6 9 ( 2 5 ) 3 9 ( 3 5 )

Example: Play Tennis? Counts of when I played tennis (did not play) Outlook Temperature Humidity Windy sunny overcast rain hot mild cool high normal false true 2 (3) 4 (0) 3 (2) 2 (2) 4 (2) 3 (1) 3 (4) 6 (1) 6 (2) 3 (3) Prior of whether I played tennis or not Counts: Play yes no 9 5 Prior Probabilities: Play yes no 9 14 5 14 Likelihood of attribute when tennis played P (x i y=yes)(p (x i y=no)) Outlook Temperature Humidity Windy sunny overcast rain hot mild cool high normal false true 2 9 ( 3 5 ) 4 9 ( 0 5 ) 3 9 ( 2 5 ) 2 9 ( 2 5 ) 4 9 ( 2 5 ) 3 9 ( 1 5 ) 3 9 ( 4 5 ) 6 9 ( 1 5 ) 6 9 ( 2 5 ) 3 9 ( 3 5 )

Example: Play Tennis? Counts of when I played tennis (did not play) Outlook Temperature Humidity Windy sunny overcast rain hot mild cool high normal false true 2 (3) 4 (0) 3 (2) 2 (2) 4 (2) 3 (1) 3 (4) 6 (1) 6 (2) 3 (3) Prior of whether I played tennis or not Counts: Play yes no 9 5 Prior Probabilities: Play yes no 9 14 5 14 Likelihood of attribute when tennis played P (x i y=yes)(p (x i y=no)) Outlook Temperature Humidity Windy sunny overcast rain hot mild cool high normal false true 2 9 ( 3 5 ) 4 9 ( 0 5 ) 3 9 ( 2 5 ) 2 9 ( 2 5 ) 4 9 ( 2 5 ) 3 9 ( 1 5 ) 3 9 ( 4 5 ) 6 9 ( 1 5 ) 6 9 ( 2 5 ) 3 9 ( 3 5 )

Example: Play Tennis? Inference: Use the learnt model to classify a new instance. New instance: x = (sunny, cool, high, true) Apply Naïve Bayes Classifier: 4 y MAP = arg max P (y) P (x i y) y {yes, no} i=1 P (yes) P (sunny yes) P (cool yes) P (high yes) P (true yes) = 9 14 2 9 3 9 3 9 3 9 =.005 P (no) P (sunny no) P (cool no) P (high no) P (true no) = 5 14 3 5 1 5 4 5 3 5 =.021 = y MAP = no

Naïve Bayes: Independence Violation Conditional independence assumption: P (x 1, x 2,..., x K y) = K P (x k y) k=1 often violated - but it works surprisingly well anyway! Note: Do not need the posterior probabilities P (y x) to be correct. Only need y MAP to be correct. Since dependencies ignored, naïve Bayes posteriors often unrealistically close to 0 or 1. Different attributes say the same thing to a higher degree than we expect as they are correlated in reality.

Naïve Bayes: Independence Violation Conditional independence assumption: P (x 1, x 2,..., x K y) = K P (x k y) k=1 often violated - but it works surprisingly well anyway! Note: Do not need the posterior probabilities P (y x) to be correct. Only need y MAP to be correct. Since dependencies ignored, naïve Bayes posteriors often unrealistically close to 0 or 1. Different attributes say the same thing to a higher degree than we expect as they are correlated in reality.

Naïve Bayes: Independence Violation Conditional independence assumption: P (x 1, x 2,..., x K y) = K P (x k y) k=1 often violated - but it works surprisingly well anyway! Note: Do not need the posterior probabilities P (y x) to be correct. Only need y MAP to be correct. Since dependencies ignored, naïve Bayes posteriors often unrealistically close to 0 or 1. Different attributes say the same thing to a higher degree than we expect as they are correlated in reality.

Naïve Bayes: Estimating Probabilities Problem: What if none of the training instances with target value y have attribute x i? Then P (x i y) = 0 = P (y) K P (x i y) = 0 Solution: Add as prior knowledge that P (x i y) must be larger than 0: i=1 where P (x i y) = n y + mp n + m n = number of training samples with label y n y = number of training samples with label y and value x i p = prior estimate of P (x i y) m = weight given to prior estimate (in relation to data)

Naïve Bayes: Estimating Probabilities Problem: What if none of the training instances with target value y have attribute x i? Then P (x i y) = 0 = P (y) K P (x i y) = 0 Solution: Add as prior knowledge that P (x i y) must be larger than 0: i=1 where P (x i y) = n y + mp n + m n = number of training samples with label y n y = number of training samples with label y and value x i p = prior estimate of P (x i y) m = weight given to prior estimate (in relation to data)

Example: Spam detection Aim: Build a classifier to identify spam e-mails. How: Training Create dictionary of words and tokens W = {w 1,..., w L }. These words should be those which are specific to spam or non-spam e-mails. E-mail is a concatenation, in order, of its words and tokens: e = (e 1, e 2,..., e K ) with e i W. Must model and learn P (e 1, e 2,..., e K spam) and P (e 1, e 2,..., e K not spam)

Example: Spam detection Aim: Build a classifier to identify spam e-mails. How: Training Create dictionary of words and tokens W = {w 1,..., w L }. These words should be those which are specific to spam or non-spam e-mails. E-mail is a concatenation, in order, of its words and tokens: e = (e 1, e 2,..., e K ) with e i W. Must model and learn P (e 1, e 2,..., e K spam) and P (e 1, e 2,..., e K not spam)

Example: Spam detection Aim: Build a classifier to identify spam e-mails. How: Training Create dictionary of words and tokens W = {w 1,..., w L }. These words should be those which are specific to spam or non-spam e-mails. E-mail is a concatenation, in order, of its words and tokens: e = (e 1, e 2,..., e K ) with e i W. Must model and learn P (e 1, e 2,..., e K spam) and P (e 1, e 2,..., e K not spam) Email: E Vector: e Dear customer, A fully licensed Online Pharmacy is offering pharmaceuticals: - brought to you directly from abroad -produced by the same multinational corporations selling through the major US pharmacies -priced up to 5 times cheaper as compared to major US pharmacies. Enjoy the US dollar purchasing power on http://pharmacy-buyonline.com.ua/ ('dear', 'customer', ',', 'a', 'fully', 'licensed',...,'/') Concatenate words from e-mail into a vector

Example: Spam detection Aim: Build a classifier to identify spam e-mails. How: Training Create dictionary of words and tokens W = {w 1,..., w L }. These words should be those which are specific to spam or non-spam e-mails. E-mail is a concatenation, in order, of its words and tokens: e = (e 1, e 2,..., e K ) with e i W. Must model and learn P (e 1, e 2,..., e K spam) and P (e 1, e 2,..., e K not spam) Inference Given an e-mail, E, compute e = (e 1,..., e K ). Use Bayes rule to compute P (spam e 1,..., e K ) P (e 1,..., e K spam) P (spam)

Example: Spam detection How is the joint probability distribution modelled? P (e 1,..., e K spam) Remember K will be very large and vary from e-mail to e-mail.. Make conditional independence assumption: K P (e 1,..., e K spam) = P (e k spam) k=1 Similarly K P (e 1,..., e K not spam) = P (e k not spam) k=1 Have assumed the position of word is not important.

Example: Spam detection How is the joint probability distribution modelled? P (e 1,..., e K spam) Remember K will be very large and vary from e-mail to e-mail.. Make conditional independence assumption: K P (e 1,..., e K spam) = P (e k spam) k=1 Similarly K P (e 1,..., e K not spam) = P (e k not spam) k=1 Have assumed the position of word is not important.

Example: Spam detection How is the joint probability distribution modelled? P (e 1,..., e K spam) Remember K will be very large and vary from e-mail to e-mail.. Make conditional independence assumption: K P (e 1,..., e K spam) = P (e k spam) k=1 Similarly K P (e 1,..., e K not spam) = P (e k not spam) k=1 Have assumed the position of word is not important.

Example: Spam detection Learning: Assume one has n training e-mails and their labels - spam /non-spam Note: e i = (e i1,..., e iki ). S = {(e 1, y 1 ),..., (e n, y n )}

Example: Spam detection Learning: Assume one has n training e-mails and their labels - spam /non-spam Note: e i = (e i1,..., e iki ). Create dictionary S = {(e 1, y 1 ),..., (e n, y n )} 1 Make a union of all the distinctive words and tokens in e 1,..., e n to create W = {w 1,..., w L }. (Note: words such as and, the,... omitted)

Example: Spam detection Learning: Assume one has n training e-mails and their labels - spam /non-spam Note: e i = (e i1,..., e iki ). Learn probabilities For y {spam, not spam} S = {(e 1, y 1 ),..., (e n, y n )} 1 Set P (y) = n i=1 Ind(yi=y) n proportion of e-mails from class y. 2 n y = n i=1 K i Ind (y i = y) total # of words in the class y e-mails. 3 For each word w l compute n yl = n i=1 Ind (y i = y) occurrences of word w l in the class y e-mails. 4 P (w l y) = n yl+1 n y+ W ( Ki k=1 Ind (e ik = w l ) ) # of assume prior value of P (w l y) is 1/ W.

Example: Spam detection Inference: Classify a new e-mail e = (e 1,..., e K ) y = arg max y { 1,1} K P (y) P (e k y) k=1

Summary so far Bayesian theory: Combines prior knowledge and observed data to find the most probable hypothesis. Naïve Bayes Classifier: All variables considered independent.

Expectation-Maximization (EM) Algorithm

Mixture of Gaussians This distribution is a weight sum of K Gaussian distributions K P (x) = π k N (x; µ k, σk 2 ) ure 7.6 Mixture of Gaussians el in 1D. A complex multimodal bability density function (black d curve) is created by taking a ghted sum or mixture of several stituent normal distributions with erent means and variances (red, n and blue dashed curves). To ure that the final distribution is aliddensity,theweightsmustbe itive and sum to one. k=1 where π 1 + + π K = 1 7 Modeling complex data densities and π k > 0 (k = 1,..., K). This model can describe complex multi-modal probability distributions by combining simpler distributions. st function for the M-Step (equation 7.12) improves the bound. For now we ssume that these things are true and proceed with the main thrust of the er. We will return to these issues in section 7.8.

Mixture of Gaussians P (x) = K π k N (x; µ k, σk 2 ) k=1 Learning the parameters of this model from training data x 1,..., x n is not trivial - using the usual straightforward maximum likelihood approach. Instead learn parameters using the Expectation-Maximization (EM) algorithm.

Mixture of Gaussians as a marginalization We can interpret the Mixture of Gaussians model with the introduction of a discrete hidden/latent variable h and P (x, h): K K P (x) = P (x, h = k) = P (x h = k)p (h = k) k=1 = k=1 K π k N (x; µ k, σk) 2 7.4 Mixture of Gaussians 111 k=1 Figure 7.7 Mixture of Gaussians as a marginalization. The mixture of Gaussians can also be thought of in terms of a joint distribution Pr(x, h) between the observed variable x and a discrete hidden variable h. To create the mixture density we marginalize over h. The hidden variable has a straightforward interpretation: it is the index of the constituent normal distribution. mixture density Figures must takenbe from positive Computer definite. Vision: For a simpler models, approach, learning weand express inference the observed by Simon density Prince. as a marginalization and use the EM algorithm to learn the parameters.

EM for two Gaussians Assume: We know the pdf of x has this form: P (x) = π 1 N (x; µ 1, σ1) 2 + π 2 N (x; µ 2, σ2) 2 where π 1 + π 2 = 1 and π k > 0 for components k = 1, 2. Unknown: Values of the parameters (Many!) Θ = (π 1, µ 1, σ 1, µ 2, σ 2 ). Have: Observed n samples x 1,..., x n drawn from p(x). Want to: Estimate Θ from x 1,..., x n. How would it be possible to get them all???

EM for two Gaussians For each sample x i introduce a hidden variable h i { 1 if sample x i was drawn from N (x; µ 1, σ 2 h i = 1) 2 if sample x i was drawn from N (x; µ 2, σ2) 2 and come up with initial values for each of the parameters. Θ (0) = (π (0) 1, µ (0) 1, σ (0) 1, µ (0) 2, σ (0) 2 ) EM is an iterative algorithm which updates Θ (t) using the following two steps...

EM for two Gaussians: E-step The responsibility of k-th Gaussian for each sample x (indicated by the size of the projected data point) 112 7 Modeling complex data densities Look at each sample x along hidden variable h in Figure 7.8 E-Step for fitting the mixture of Gaussians model. For each of the I data points x1...i, we calculate the posterior distribution Pr(hi xi) over the hidden variable hi. The posterior probability Pr(hi = k xi) thathi takes value k can be understood thease-step responsibility of normal distribution k for data point xi. For example, for data point x1 (magenta circle) the component 1 (red curve) is more than twice as likely to be responsible than component 2 (green curve). Note that in the joint distribution (left), the Figure from Computer size ofvision: the projected models, data point learning indicates the responsibility. and inference by Simon Prince.

EM for two Gaussians: E-step (cont.) E-step: Compute the posterior probability that x i was generated by component k given the current estimate of the parameters Θ (t). (responsibilities) for i = 1,... n for k = 1, 2 γ (t) ik = P (h i = k x i, Θ (t) ) = π (t) k N (x i; µ (t) k, σ(t) k ) π (t) 1 N (x i ; µ (t) 1, σ (t) 1 ) + π (t) 2 N (x i ; µ (t) 2, σ (t) 2 ) Note: γ (t) i1 + γ(t) i2 = 1 and π 1 + π 2 = 1

EM for two Gaussians: M-step Fitting the Gaussian model for each of k-th constinuetnt. 7.4 Mixture of Gaussians 113 Sample x i contributes according to the responsibility γ ik. (dashed and solid lines for fit before and after update) Figure 7.9 M-Step for fitting the mixture of Gaussians model. For the k th constituent Gaussian, we update the parameters {λ k, µ k, Σ k}. The i th data point x i contributes to these updates according to the responsibility r ik (indicated by size of point) assigned in the E-Step; data points that are Figure from more Computer associatedvision: with themodels, k th component learning have and more inference effect onby thesimon parameters. Prince. Look along samples x for each h in the M-step

EM for two Gaussians: M-step (cont.) M-step: Compute the Maximum Likelihood of the parameters of the mixture model given out data s membership distribution, the γ (t) i s: for k = 1, 2 n µ (t+1) i=1 k = γ(t) ik x i n, i=1 γ(t) ik σ (t+1) k = n i=1 γ(t) ik (x i µ (t+1) n i=1 γ(t) ik k ) 2, n π (t+1) i=1 k = γ(t) ik. n

EM in practice 114 7 Modeling complex data densities

Summary Bayesian theory: Combines prior knowledge and observed data to find the most probable hypothesis. Naïve Bayes Classifier: All variables considered independent. EM algorithm: Learn probability destribiution (model parameters) in presence of hidden variables. If you are interested in learning more take a look at: C. M. Bishop, Pattern Recognition and Machine Learning, Springer Verlag 2006.