COMP 328: Machine Learning

Similar documents
The Naïve Bayes Classifier. Machine Learning Fall 2017

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

CSCE 478/878 Lecture 6: Bayesian Learning and Graphical Models. Stephen Scott. Introduction. Outline. Bayes Theorem. Formulas

Stephen Scott.

Lecture 9: Bayesian Learning

Naïve Bayes classification

UVA CS / Introduc8on to Machine Learning and Data Mining

CSCE 478/878 Lecture 6: Bayesian Learning

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Bayesian Methods: Naïve Bayes

Bayesian Learning Features of Bayesian learning methods:

Introduction to ML. Two examples of Learners: Naïve Bayesian Classifiers Decision Trees

Artificial Intelligence: Reasoning Under Uncertainty/Bayes Nets

Bayesian Classification. Bayesian Classification: Why?

Algorithms for Classification: The Basic Methods

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Probability Based Learning

MLE/MAP + Naïve Bayes

Introduction to Machine Learning

Introduction to Machine Learning

Introduction to Bayesian Learning

Machine Learning. Bayesian Learning.

Decision Trees. Tirgul 5

Bayesian Learning. Remark on Conditional Probabilities and Priors. Two Roles for Bayesian Methods. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.

Inteligência Artificial (SI 214) Aula 15 Algoritmo 1R e Classificador Bayesiano

Representation. Stefano Ermon, Aditya Grover. Stanford University. Lecture 2

Classification. Classification. What is classification. Simple methods for classification. Classification by decision tree induction

Soft Computing. Lecture Notes on Machine Learning. Matteo Mattecci.

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

10-701/ Machine Learning: Assignment 1

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Be able to define the following terms and answer basic questions about them:

Data Mining Part 4. Prediction

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Bayesian Learning. Chapter 6: Bayesian Learning. Bayes Theorem. Roles for Bayesian Methods. CS 536: Machine Learning Littman (Wu, TA)

Learning Decision Trees

Bayesian Learning. Bayesian Learning Criteria

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Decision Tree Learning Mitchell, Chapter 3. CptS 570 Machine Learning School of EECS Washington State University

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Naïve Bayes Lecture 6: Self-Study -----

An AI-ish view of Probability, Conditional Probability & Bayes Theorem

10/18/2017. An AI-ish view of Probability, Conditional Probability & Bayes Theorem. Making decisions under uncertainty.

Machine Learning. Yuh-Jye Lee. March 1, Lab of Data Science and Machine Intelligence Dept. of Applied Math. at NCTU

LINEAR CLASSIFICATION, PERCEPTRON, LOGISTIC REGRESSION, SVC, NAÏVE BAYES. Supervised Learning

Machine Learning. Bayesian Learning. Acknowledgement Slides courtesy of Martin Riedmiller

Machine Learning Recitation 8 Oct 21, Oznur Tastan

Estimating Parameters

Lecture 24: Other (Non-linear) Classifiers: Decision Tree Learning, Boosting, and Support Vector Classification Instructor: Prof. Ganesh Ramakrishnan

Machine Learning

Introduction. Decision Tree Learning. Outline. Decision Tree 9/7/2017. Decision Tree Definition

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Decision Trees. Data Science: Jordan Boyd-Graber University of Maryland MARCH 11, Data Science: Jordan Boyd-Graber UMD Decision Trees 1 / 1

Where are we? è Five major sec3ons of this course

Bayesian Approaches Data Mining Selected Technique

Supervised Learning! Algorithm Implementations! Inferring Rudimentary Rules and Decision Trees!

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Notes on Machine Learning for and

CS 188: Artificial Intelligence. Machine Learning

MLE/MAP + Naïve Bayes

Algorithmisches Lernen/Machine Learning

Machine Learning

Introduction to Machine Learning

BAYESIAN LEARNING. [Read Ch. 6] [Suggested exercises: 6.1, 6.2, 6.6]

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

CS6375: Machine Learning Gautam Kunapuli. Decision Trees

Naïve Bayes Classifiers

Basic Probability and Statistics

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

CS 188: Artificial Intelligence Spring Today

Learning Decision Trees

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 11: Probability Review. Dr. Yanjun Qi. University of Virginia. Department of Computer Science

Machine Learning 2nd Edi7on

Naive Bayes classification

Mining Classification Knowledge

Bayesian Learning (II)

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Introduction to Machine Learning Midterm, Tues April 8

COMP 551 Applied Machine Learning Lecture 19: Bayesian Inference

Dimensionality reduction

COMP61011 : Machine Learning. Probabilis*c Models + Bayes Theorem

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

COMP538: Introduction to Bayesian Networks

Decision Trees.

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

CSE-4412(M) Midterm. There are five major questions, each worth 10 points, for a total of 50 points. Points for each sub-question are as indicated.

A.I. in health informatics lecture 3 clinical reasoning & probabilistic inference, II *

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

CLASSIFICATION NAIVE BAYES. NIKOLA MILIKIĆ UROŠ KRČADINAC

Overview of Course. Nevin L. Zhang (HKUST) Bayesian Networks Fall / 58

T Machine Learning: Basic Principles

Probabilistic Graphical Models for Image Analysis - Lecture 1

Generative Learning algorithms

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Introduction to Machine Learning Midterm Exam

Data classification (II)

Transcription:

COMP 328: Machine Learning Lecture 2: Naive Bayes Classifiers Nevin L. Zhang Department of Computer Science and Engineering The Hong Kong University of Science and Technology Spring 2010 Nevin L. Zhang (HKUST) COMP 328 Spring 2010 1 / 34

Two different types of classifiers Decision tree classifiers Data Decision rules Classify unseen examples using the rules Naive Bayes classifiers Data probabilistic model about relationship among class and attributes Classify unseen examples via inference based on the model Nevin L. Zhang (HKUST) COMP 328 Spring 2010 2 / 34

Outline Probabilistic Models and Classification 1 Probabilistic Models and Classification 2 Probabilistic Independence 3 Na Ive Bayes Model Classifiers 4 Issues 5 Learning to Classify Text Nevin L. Zhang (HKUST) COMP 328 Spring 2010 3 / 34

Probabilistic Models and Classification Joint Distribution The most general way to describe relationships among random variables. Example: P(Temperature, Wind, PlayTennis) Temperature Wind PlayTennis Probability Hot Weak No 0.1 Hot Weak Yes 0 Hot Strong No 0.2 Hot Strong Yes 0.3 Cool Weak No 0.1 Cool Weak Yes 0.2 Cool Strong No 0.1 Cool Strong Yes 0 One prob value for each combination of the states of the variables. All the prob values must sum to 1. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 4 / 34

Probabilistic Models and Classification Joint Distribution and Classification Suppose we have a joint distribution. How do classify? Calculate probability and then classify. Play tennis on a Hot day with Strong Wind? P(Play = y T = h, W = s) = = P(Play = y, T = h, W = s) P(T = h, W = s) 0.3 0.2 + 0.3 = 0.6 P(Play = n T = h, W = s) = 1 P(Play = y T = h, W = s) = 1 0.6 = 0.4 Answer: yes (if must answer with yes or no) Nevin L. Zhang (HKUST) COMP 328 Spring 2010 5 / 34

Probabilistic Models and Classification Joint Distribution and Classification We can perform classification even with a subset of attributes Play tennis when Wind is weak? P(Play = y W = w) = = P(Play = y, W = w) P(W = w) 0 + 0.2 0.1 + 0 + 0.1 + 0.2 = 0.67 P(Play = n T = h) = 0.33 Answer: yes Nevin L. Zhang (HKUST) COMP 328 Spring 2010 6 / 34

Probabilistic Models and Classification The General Case Suppose we have a joint distribution P(A 1,A 2,...,A n,c) over attributes A 1,A 2,...,A n and the class variable C. For a new example with attribute values a 1,a 2,...,a n, For each possible value v j of class variable C, compute the probability P(C = v j A 1 = a 1, A 2 = a 2,..., A n = a n ) Assign the example to the class v j with the highest posterior probability: For all v j P(C = v j A 1 = a 1,A 2 = a 2,...,A n = a n) P(C = v j A 1 = a 1, A 2 = a 2,...,A n = a n) cj = arg max v j class labels P(C = v j A 1 = a 1,A 2 = a 2,..., A n = a n) Nevin L. Zhang (HKUST) COMP 328 Spring 2010 7 / 34

Probabilistic Models and Classification A Difficult 100 binary attributes, and a binary class variables. How many entries in the joint probability table? 2 101. Too many to handle. Too many to estimate from data. Leads to overfit. The Naive Bayes model reduces the number of model parameters by making independence assumption. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 8 / 34

Outline Probabilistic Independence 1 Probabilistic Models and Classification 2 Probabilistic Independence 3 Na Ive Bayes Model Classifiers 4 Issues 5 Learning to Classify Text Nevin L. Zhang (HKUST) COMP 328 Spring 2010 9 / 34

Probabilistic Independence Marginal independence Two random variables X and Y are marginally independent, written X Y, if for any state x of X and any state y of Y, P(X=x Y =y) = P(X=x), whenever P(Y = y) 0. Meaning: Learning the value of Y does not give me any information about X and vice versa.y contains no information about X and vice versa. Equivalent definition: P(X=x,Y =y) = P(X=x)P(Y =y) Shorthand for the equations: P(X Y ) = P(X),P(X,Y ) = P(X)P(Y ). Nevin L. Zhang (HKUST) COMP 328 Spring 2010 10 / 34

Probabilistic Independence Marginal independence Examples: X:result of tossing a fair coin for the first time, Y: result of second tossing of the same coin. X: result of US election, Y : your grades in this course. Counter example:x midterm exam grade, Y final exam grade. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 11 / 34

Probabilistic Independence Conditional independence Two random variables X and Y are conditionally independent given a third variable Z,written X Y Z, if P(X=x Y=y, Z=z) = P(X=x Z=z) whenever P(Y =y, Z=z) 0 Meaning: If I know the state of Z already, then learning the state of Y does not give me additional information about X. Y might contain some information about X. However all the information about X contained in Y are also contained in Z. Shorthand for the equation: Equivalent definition: P(X Y, Z) = P(X Z) P(X, Y Z) = P(X Z)P(Y Z) Nevin L. Zhang (HKUST) COMP 328 Spring 2010 12 / 34

Probabilistic Independence Example of Conditional Independence There is a bag of 100 coins. 10 coins were made by a malfunctioning machine and are biased toward head. Tossing such a coin results in head 80% of the time. The other coins are fair. Randomly draw a coin from the bag and toss it a few time. X i : result of the i-th tossing, Y : whether the coin is produced by the malfunctioning machine. The X i s are not marginally independent of each other: If I get 9 heads in first 10 tosses, then the coin is probably a biased coin. Hence the next tossing will be more likely to result in a head than a tail. Learning the value of X i gives me some information about whether the coin is biased, which in term gives me some information about X j. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 13 / 34

Probabilistic Independence Example of Conditional Independence However, they are conditionally independent given Y : If the coin is not biased, the probability of getting a head in one toss is 1/2 regardless of the results of other tosses. If the coin is biased, the probability of getting a head in one toss is 80% regardless of the results of other tosses. If I already knows whether the coin is biased or not, learning the value of X i does not give me additional information about X j. Here is how the variables are related pictorially. We will return to this picture later. Coin Type Toss 1 Result Toss 2 Result... Toss n Result Nevin L. Zhang (HKUST) COMP 328 Spring 2010 14 / 34

Outline Naive Bayes Model Classifiers 1 Probabilistic Models and Classification 2 Probabilistic Independence 3 Na Ive Bayes Model Classifiers 4 Issues 5 Learning to Classify Text Nevin L. Zhang (HKUST) COMP 328 Spring 2010 15 / 34

Naive Bayes Model Classifiers The Naive Bayes Model It assumes that the attributes are mutually independent of each other given the class variable. Graphically depicted as: Joint distribution given by n P(C,A 1,A 2,...A n ) = P(C) P(A i C) i=1 Nevin L. Zhang (HKUST) COMP 328 Spring 2010 16 / 34

Naive Bayes Model Classifiers Learning the Naive Bayes Model Learning amounts to estimating from data: P(C), P(A 1 C),..., P(A n C). Straightforward to compute: ˆP(C = v j ) = ˆP(A i = a i C = v j ) = # of examples with C = v j total # of examples # of examples with C = v j and A i = a i # of examples with C = v j Although simple, those are the maximum likelihood estimates (MLE) of the parameters. Have nice properties. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 17 / 34

Naive Bayes Model Classifiers Classification the Naive Bayes Model For a new example with attribute values: a 1, a 2,..., a n, assign it to this class: v NB = arg max ˆP(C = v j ) v j V n ˆP(A i = a i C = v j ) i=1 Nevin L. Zhang (HKUST) COMP 328 Spring 2010 18 / 34

Example: PlayTennis Naive Bayes Model Classifiers Day Outlook Temperature Humidity Wind PlayTennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Nevin L. Zhang (HKUST) COMP 328 Spring 2010 19 / 34

Naive Bayes Model Classifiers Example: Estimate parameters P(PlayTennis = y) = 9/14 P(PlayTennis = n) = 5/14 P(Outlook = sunny y) = 2/9 P(Outlook = sunny n) = 3/5 P(Outlook = overcast y) = 4/9 P(Outlook = overcast n) = 0/5 P(Outlook = rain y) = 3/9 P(Outlook = rain n) = 2/5 P(Temp = hot y) = 2/9 P(Temp = hot PlayTennis = n) = 2/5 P(Temp = mild y) = 4/9 P(Temp = mild n) = 2/5 P(Temp = cool y) = 3/9 P(Temp = cool n) = 1/5 P(Humidity = high y) = 3/9 P(Humidity = normal n) = 1/5 P(Humidity = normal y) = 6/9 P(Humidity = high n) = 4/5 P(Wind = strong y) = 3/9 P(Wind = strong n) = 3/5 P(Wind = weak y) = 6/9 P(Wind = weak n) = 2/5 Nevin L. Zhang (HKUST) COMP 328 Spring 2010 20 / 34

Naive Bayes Model Classifiers Example:Classification New case: (Sunny, Cool, High, Strong) PlayTennis? Inference: P(y)P(sunny y)p(cool y)p(high y)p(strong y) =.005 P(n)P(sunny n)p(cool n)p(high n)p(strong n) =.021 Conclusion: v NB = n No, don t play. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 21 / 34

Outline Issues 1 Probabilistic Models and Classification 2 Probabilistic Independence 3 Na Ive Bayes Model Classifiers 4 Issues 5 Learning to Classify Text Nevin L. Zhang (HKUST) COMP 328 Spring 2010 22 / 34

Zero counts Issues None of the training instances with target v j have attribute a i? ˆP(a i v j ) = 0 Hence, we have ˆP(v j ) i ˆP(a i v j ) = 0 Future example with a i, no chance to be classifies as v j, even if all other attribute values suggest v j. Smoothing: Add virtual count 1 to each case. (Laplace Smoothing/Correction) ˆP(A i = a i C = v j ) = (# of examples with C = v j and A i = a i ) + 1 (# of examples with C = v j ) + C C : number of classes Nevin L. Zhang (HKUST) COMP 328 Spring 2010 23 / 34

Weka does just that Issues P(outlook yes) is {3/12, 5/12, 4/12} instead of {2/9, 4/9, 3/9} as on Slide 20. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 24 / 34

Continuous attributes Issues Continuous attributes? Discretize them. Equal intervals. Weka has MDL method. Use parameteric form (Gaussian) for P(A i C). Nevin L. Zhang (HKUST) COMP 328 Spring 2010 25 / 34

Issues Conditional independence assumption Assumption: A i s mutually independent of each other given C. Often violated. Might lead to double counting To see this, suppose we duplicate A 1. So we have two copies of A 1 in data: A 1 and A 1. Information in data remain the same. However, classification might be different: arg max ˆP(v j )ˆP(A 1 v j )ˆP(A 1 v j )ˆP(A 2 v j )... v j V The evidence on A 1 is counted twice. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 26 / 34

Issues Conditional independence assumption Naive Bayes classifier works surprisingly well anyway! Reason: Although ˆP(v j ) i ˆP(a i v j ) might be a poor estimation of ˆP(v j, a 1...,a n ) to be correct we might still have arg max ˆP(v j ) v j V i ˆP(a i v j ) = arg max v j V P(v j, a 1..., a n ) Nevin L. Zhang (HKUST) COMP 328 Spring 2010 27 / 34

Issues Notes: Conditional independence assumption Bayesian (belief) network classifier: relax the assumption Nevin L. Zhang (HKUST) COMP 328 Spring 2010 28 / 34

Issues Overfitting Overfitting is not an issue for Naive Bayes classifier because the complexity is fixed. However it is an issue for Bayesian networks. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 29 / 34

Outline Learning to Classify Text 1 Probabilistic Models and Classification 2 Probabilistic Independence 3 Na Ive Bayes Model Classifiers 4 Issues 5 Learning to Classify Text Nevin L. Zhang (HKUST) COMP 328 Spring 2010 30 / 34

Learning to Classify Text Spam and non-spam E-mails Classify emails into spam and non-spam according to content. Classes: S and S Attributes: w 1, w 2,..., w n, a list of words. From training set. Stop word removal: e.g., a, the Stemming: e.g., engineering, engineered, engineer engineer Parameters P(w i S): probability of word w i appear in a spam mail. P(w i S): probability of word w i appear in a non-spam mail. P(S) and P( S): probability of a mail being spam or non-spam All those can be obtained from a training set by counting. Nevin L. Zhang (HKUST) COMP 328 Spring 2010 31 / 34

Learning to Classify Text Spam and non-spam E-mails New document D with words: X 1, X 2,..., X n A subset of w 1, w 2,..., w n P(D S) = m i=1 P(X i S) P(D S) = m i=1 P(X i S) Document is spam if P(S) n n P(X i S) > P( S) P(X i S) i=1 Can we do this with decision trees? i=1 Nevin L. Zhang (HKUST) COMP 328 Spring 2010 32 / 34

Learning to Classify Text Nevin L. Zhang (HKUST) COMP 328 Spring 2010 33 / 34

Learning to Classify Text Final Remark When to use Naive Bayes Classifiers? Moderate or large training set available Attributes that describe instances are conditionally independent given classification Successful applications: Diagnosis Classifying text documents Nevin L. Zhang (HKUST) COMP 328 Spring 2010 34 / 34