Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Similar documents
Naïve Bayes classification

The Naïve Bayes Classifier. Machine Learning Fall 2017

Bayesian Methods: Naïve Bayes

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Bayesian Learning (II)

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Notes on Machine Learning for and

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

CSC 411: Lecture 09: Naive Bayes

Introduction to Machine Learning

Introduction to Machine Learning

Machine Learning, Fall 2012 Homework 2

Logistic Regression. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Introduction to Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Bayesian Learning. Bayesian Learning Criteria

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Introduction to Machine Learning

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

SYDE 372 Introduction to Pattern Recognition. Probability Measures for Classification: Part I

Machine Learning 4771

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

MLE/MAP + Naïve Bayes

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Machine Learning for Signal Processing Bayes Classification

Naive Bayes and Gaussian Bayes Classifier

Gaussian discriminant analysis Naive Bayes

Naive Bayes and Gaussian Bayes Classifier

COMP 328: Machine Learning

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Probabilistic Classification

Naive Bayes and Gaussian Bayes Classifier

Probability Based Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

MLE/MAP + Naïve Bayes

The Bayes classifier

Introduction to Bayesian Learning

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Introduction to Bayesian Learning. Machine Learning Fall 2018

Learning with Probabilities

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

COMP90051 Statistical Machine Learning

Support Vector Machines

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Probability Theory for Machine Learning. Chris Cremer September 2015

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

An Introduction to Statistical and Probabilistic Linear Models

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Machine Learning for Signal Processing Bayes Classification and Regression

Generative Learning algorithms

Stephen Scott.

Introduction to Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

Comments. x > w = w > x. Clarification: this course is about getting you to be able to think as a machine learning expert

Intro. ANN & Fuzzy Systems. Lecture 15. Pattern Classification (I): Statistical Formulation

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Learning Bayesian network : Given structure and completely observed data

Logistic Regression. Machine Learning Fall 2018

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

Lecture 4: Probabilistic Learning

CSCI-567: Machine Learning (Spring 2019)

Statistical learning. Chapter 20, Sections 1 3 1

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Machine Learning

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Introduction to Probabilistic Machine Learning

Be able to define the following terms and answer basic questions about them:

CSCE 478/878 Lecture 6: Bayesian Learning

Probabilistic Graphical Models

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Bayesian Learning. Two Roles for Bayesian Methods. Bayes Theorem. Choosing Hypotheses

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Bayesian Machine Learning

Algorithmisches Lernen/Machine Learning

Lecture : Probabilistic Machine Learning

Lecture 9: PGM Learning

Machine Learning. Regression-Based Classification & Gaussian Discriminant Analysis. Manfred Huber

Mathematical Formulation of Our Example

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Classical and Bayesian inference

Why Probability? It's the right way to look at the world.

Bayesian Models in Machine Learning

Qualifier: CS 6375 Machine Learning Spring 2015

UVA CS / Introduc8on to Machine Learning and Data Mining

Some Concepts of Probability (Review) Volker Tresp Summer 2018

Introduction to Machine Learning

Transcription:

Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish between discrete and continuous variables. The distribution of a discrete random variable: The probabilities of each value it can take. Notation: P(X = x i ). These numbers satisfy: X P (X = x i )=1 i 1 2 Probability theory Probability theory A joint probability distribution for two variables is a table. p ij If the two variables are binary, how many parameters does it have? Let s consider now the joint probability of d variables P(X 1,,X d ). p ij Joint Probability Marginal Probability P (X = x i )= X P (X = x i,y = y j ) j Conditional Probability How many parameters does it have if each variable is binary? 3 1

The rules of probability Using probability in learning Suppose we have access to P(y x), we would make a prediction according to argmax y P(y x). This is called the Bayes-optimal classifier. Marginalization What is the error of such a classifier? Product Rule P(y=-1 x) P(y=+1 x) Independence: X and Y are independent if P(Y X) = P(Y) This implies P(X,Y) = P(X) P(Y) 6 Using probability in learning P(y=-1 x) P(y=+1 x) Some classifiers model P(Y X) directly: discriminative learning However, it s usually easier to model P(X Y) from which we can get P(Y X) using Bayes rule: P (X Y )P (Y ) P (Y X) = P (X) This is called generative learning Maximum likelihood Fit a probabilistic model P(x θ) to data Estimate θ Given independent identically distributed (i.i.d.) data X = (x 1, x 2,, x n ) Likelihood P (X ) =P (x 1 )P (x 2 ),...,P(x n ) Log likelihood nx ln P (X ) = ln P (x i ) i=1 Maximum likelihood solution: parameters θ that maximize ln P(X θ) 7 2

: coin toss Estimate the probability p that a coin lands Heads using the result of n coin tosses, h of which resulted in heads. The likelihood of the data: Log likelihood: Taking a derivative and setting to 0: P (X ) =p h (1 p) n h ln P (X ) =h ln p +(n h)ln(1 p) @ ln P (X ) @p = h p (n h) (1 p) =0 From the product rule: and: Therefore: P(Y, X) = P(Y X) P(X) P(Y, X) = P(X Y) P(Y) P (Y X) = This is known as Bayes rule Bayes rule P (X Y )P (Y ) P (X) ) p = h n 9 10 P (Y X) = posterior Bayes rule P (X Y )P (Y ) P (X) posterior likelihood prior P(X) can be computed as: P (X) = X Y P (X Y )P (Y ) likelihood prior Maximum a-posteriori and maximum likelihood The maximum a posteriori (MAP) rule: y MAP = arg max Y P (X Y )P (Y ) P (Y X) = arg max Y P (X) If we ignore the prior distribution or assume it is uniform we obtain the maximum likelihood rule: y ML = arg max P (X Y ) Y = arg max P (X Y )P (Y ) Y A classifier that has access to P(Y X) is a Bayes optimal classifier. But is not important for inferring a label 12 3

Classification with Bayes rule We would like to model P(X Y), where X is a feature vector, and Y is its associated label. Task:%Predict%whether%or%not%a%picnic%spot%is%enjoyable% % Training&Data:&& X%=%(X 1 %%%%%%%X 2 %%%%%%%%X 3 %%%%%%%% %%%%%%%% %%%%%%%X d )%%%%%%%%%%Y% Naïve Bayes classifier We would like to model P(X Y), where X is a feature vector, and Y is its associated label. Simplifying assumption: conditional independence: given the class label the features are independent, i.e. P (X Y )=P (x 1 Y )P (x 2 Y ),...,P(x d Y ) n&rows& How many parameters now? How many parameters? Prior: P(Y) k-1 if k classes Likelihood: P(X Y) (2 d 1)k for binary features 13 14 Naïve Bayes classifier We would like to model P(X Y), where X is a feature vector, and Y is its associated label. Simplifying assumption: conditional independence: given the class label the features are independent, i.e. P (X Y )=P (x 1 Y )P (x 2 Y ),...,P(x d Y ) Naïve Bayes decision rule: y NB = arg max Y Naïve Bayes classifier P (X Y )P (Y ) = arg max Y dy P (x i Y )P (Y ) i=1 If conditional independence holds, NB is an optimal classifier! How many parameters now? dk + k - 1 15 16 4

Training a Naïve Bayes classifier Training data: Feature matrix X (n x d) and labels y 1, y n Maximum likelihood estimates: Class prior: Likelihood: ˆP (y) = {i : y i = y} n ˆP (x i y) = ˆP (x i,y) ˆP (y) = {i : X ij = x i,y i = y} /n {i : y i = y} /n Email classification Suppose our vocabulary contains three words a, b and c, and we use a multivariate Bernoulli model for our e-mails, with parameters = (0.5,0.67,0.33) = (0.67,0.33,0.33) This means, for example, that the presence of b is twice as likely in spam (+), compared with ham. The e-mail to be classified contains words a and b but not c, and hence is described by the bit vector x = (1,1,0). We obtain likelihoods P(x ) = 0.5 0.67 (1 0.33) = 0.222 P(x ) = 0.67 0.33 (1 0.33) = 0.148 The ML classification of x is thus spam. 17 18 Email classification: training data E-mail a? b? c? Class e 1 0 1 0 + e 2 0 1 1 + e 3 1 0 0 + e 4 1 1 0 + e 5 1 1 0 e 6 1 0 1 e 7 1 0 0 e 8 0 0 0 ount vectors. (right) The same data set What are the parameters of the model? Email classification: training data E-mail a? b? c? Class e1 0 1 0 + e2 0 1 1 + e3 1 0 0 + e4 1 1 0 + e5 1 1 0 e6 1 0 1 e7 1 0 0 e8 0 0 0 ount vectors. (right) The same data set What are the parameters of the model? ˆP (y) = {i : y i = y} n ˆP (x i y) = ˆP (x i,y) ˆP (y) = {i : X ij = x i,y i = y} /n {i : y i = y} /n 19 20 5

Email classification: training data E-mail a? b? c? Class e1 0 1 0 + e2 0 1 1 + e3 1 0 0 + e4 1 1 0 + e5 1 1 0 e6 1 0 1 e7 1 0 0 e8 0 0 0 ount vectors. (right) The same data set What are the parameters of the model? P(+) = 0.5, P(-) = 0.5 P(a +) = 0.5, P(a -)= 0.75 P(b +) = 0.75, P(b -)= 0.25 P(c +) = 0.25, P(c -)= 0.25 ˆP (x i y) = ˆP (x i,y) ˆP (y) ˆP (y) = {i : y i = y} n = {i : Xij = xi,yi = y} /n {i : y i = y} /n Comments on Naïve Bayes Usually features are not conditionally independent, i.e. P (X Y ) 6= P (x 1 Y )P (x 2 Y ),...,P(x d Y ) Despite that, Naïve Bayes is still one of the most widely used classifiers. Easy and fast training. It often performs well even when the assumption is violated. Domingos, P., & Pazzani, M. (1997). Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier. Machine Learning. 29, 103-130. 21 22 When there are few training examples What if you never see a training example where x 1 =a when y=spam? P(x spam) = P(a spam) P(b spam) P(c spam) = 0 What to do? When there are few training examples What if you never see a training example where x 1 =a when y=spam? P(x spam) = P(a spam) P(b spam) P(c spam) = 0 What to do? Add virtual examples for which x 1 =a when y=spam. 23 24 6

Naïve Bayes for continuous variables Need to talk about continuous distributions! Continuous Probability Distributions The probability of the random variable assuming a value within some given interval from x 1 to x 2 is defined to be the area under the graph of the probability density function between x 1 and x 2. f (x) Uniform f (x) Normal/Gaussian x x x x 1 x 1 x 2 2 25 Expectations The Gaussian (normal) distribution Discrete variables Continuous variables Conditional expectation (discrete) Approximate expectation (discrete and continuous) 7

Properties of the Gaussian distribution Standard Normal Distribution 99.72% 95.44% 68.26% A random variable having a normal distribution with a mean of 0 and a standard deviation of 1 is said to have a standard normal probability distribution. µ 3σ µ 1σ µ 2σ µ µ + 1σ µ + 2σ µ + 3σ x Standard Normal Probability Distribution Gaussian Parameter Estimation Converting to the Standard Normal Distribution z x = µ σ Likelihood function We can think of z as a measure of the number of standard deviations x is from µ. 8

Maximum (Log) Likelihood 34 Gaussian models Assume we have data that belongs to three classes, and assume a likelihood that follows a Gaussian distribution Likelihood function: P (X i = x Y = y k )= Gaussian Naïve Bayes 1 (x µik ) 2 p exp 2 ik 2 2 ik Need to estimate mean and variance for each feature in each class. 35 36 9

Naïve Bayes demo Summary scikit-learn provides both Gaussian and multinomial/binomial flavors of Naïve Bayes. Naïve Bayes classifier: ² What s the assumption ² Why we make it ² How we learn it Naïve Bayes for discrete data Gaussian naïve Bayes 37 38 10