Bayes and Naïve Bayes Classifiers CS434

Similar documents
Probabilistic and Bayesian Analytics

Probability Basics. Robot Image Credit: Viktoriya Sukhanova 123RF.com

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides

Basic Probability and Statistics

Sources of Uncertainty

Naïve Bayes classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Introduction to Machine Learning

Introduction to Machine Learning

Department of Industrial Engineering Statistical Quality Control presented by Dr. Eng. Abed Schokry

Machine Learning

Machine Learning

Math 116 First Midterm October 14, 2009

CS 188: Artificial Intelligence Spring Today

CISC 4631 Data Mining

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Math 263 Assignment #3 Solutions. 1. A function z = f(x, y) is called harmonic if it satisfies Laplace s equation:

CS 331: Artificial Intelligence Naïve Bayes. Naïve Bayes

Probabilistic Models

Introduction to AI Learning Bayesian networks. Vibhav Gogate

An AI-ish view of Probability, Conditional Probability & Bayes Theorem

10/18/2017. An AI-ish view of Probability, Conditional Probability & Bayes Theorem. Making decisions under uncertainty.

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Probability Theory for Machine Learning. Chris Cremer September 2015

COMP 328: Machine Learning

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Our Status. We re done with Part I Search and Planning!

BIOSTATISTICAL METHODS

Lecture 8: September 26

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

The Naïve Bayes Classifier. Machine Learning Fall 2017

In today s lecture. Conditional probability and independence. COSC343: Artificial Intelligence. Curse of dimensionality.

Probabilistic and Bayesian Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Microscopic Properties of Gases

CS 188: Artificial Intelligence Spring Announcements

Decision Making in Complex Environments. Lecture 2 Ratings and Introduction to Analytic Network Process

Aarti Singh. Lecture 2, January 13, Reading: Bishop: Chap 1,2. Slides courtesy: Eric Xing, Andrew Moore, Tom Mitchell

Probabilistic Classification

Lecture 2: CENTRAL LIMIT THEOREM

UNCERTAINTY FOCUSED STRENGTH ANALYSIS MODEL

MATH2715: Statistical Methods

Machine Learning

CS 5522: Artificial Intelligence II

Introduction to Machine Learning Midterm, Tues April 8

CS 188: Artificial Intelligence Fall 2009

Universal Scheme for Optimal Search and Stop

Lab Manual for Engrd 202, Virtual Torsion Experiment. Aluminum module

Bayesian Approaches Data Mining Selected Technique

CMPSCI 240: Reasoning about Uncertainty

4.2 First-Order Logic

An Investigation into Estimating Type B Degrees of Freedom

Review: Bayesian learning and inference

Andrew W. Moore Professor School of Computer Science Carnegie Mellon University

Momentum Equation. Necessary because body is not made up of a fixed assembly of particles Its volume is the same however Imaginary

Machine Learning. CS Spring 2015 a Bayesian Learning (I) Uncertainty

10/15/2015 A FAST REVIEW OF DISCRETE PROBABILITY (PART 2) Probability, Conditional Probability & Bayes Rule. Discrete random variables

Uncertainty. Variables. assigns to each sentence numerical degree of belief between 0 and 1. uncertainty

Lecture 3. (2) Last time: 3D space. The dot product. Dan Nichols January 30, 2018

Where are we in CS 440?

Technical Note. ODiSI-B Sensor Strain Gage Factor Uncertainty

CS 343: Artificial Intelligence

Refresher on Discrete Probability

Study on the impulsive pressure of tank oscillating by force towards multiple degrees of freedom

Frequency Estimation, Multiple Stationary Nonsinusoidal Resonances With Trend 1

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Conditional Probability & Independence. Conditional Probabilities

Midterm. You may use a calculator, but not any device that can access the Internet or store large amounts of data.

Basic Probability. Robert Platt Northeastern University. Some images and slides are used from: 1. AIMA 2. Chris Amato

Be able to define the following terms and answer basic questions about them:

Multiplication and division. Explanation and worked examples. First, we ll look at work you should know at this level.

4 Exact laminar boundary layer solutions

Naive Bayes classification

Formules relatives aux probabilités qui dépendent de très grands nombers

A [somewhat] Quick Overview of Probability. Shannon Quinn CSCI 6900

Bayesian Methods: Naïve Bayes

Algorithms for Classification: The Basic Methods

Probabilistic Reasoning

Bayesian Learning (II)

The women s heptathlon in the Olympics consists of seven track and field

1. State-Space Linear Systems 2. Block Diagrams 3. Exercises

Machine Learning Algorithm. Heejun Kim

Modelling by Differential Equations from Properties of Phenomenon to its Investigation

Integration of Basic Functions. Session 7 : 9/23 1

What is Probability? Probability. Sample Spaces and Events. Simple Event

Workshop on Understanding and Evaluating Radioanalytical Measurement Uncertainty November 2007

Probability Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 27 Mar 2012

CSE 473: Artificial Intelligence

CS 188: Artificial Intelligence. Our Status in CS188

VIBRATION MEASUREMENT UNCERTAINTY AND RELIABILITY DIAGNOSTICS RESULTS IN ROTATING SYSTEMS

Geometric Image Manipulation. Lecture #4 Wednesday, January 24, 2018

Where are we in CS 440?

Cartesian-product sample spaces and independence

Stability of Model Predictive Control using Markov Chain Monte Carlo Optimisation

Sources of Non Stationarity in the Semivariogram

Recitation 2: Probability

Strategic Timing of Content in Online Social Networks

CS 188: Artificial Intelligence. Machine Learning

CS 331: Artificial Intelligence Probability I. Dealing with Uncertainty

Transcription:

Bayes and Naïve Bayes Classifiers CS434

In this lectre 1. Review some basic probability concepts 2. Introdce a sefl probabilistic rle - Bayes rle 3. Introdce the learning algorithm based on Bayes rle (ths the name Bayes classifier) and its extension, Naïve Beyes

Commonly sed discrete distribtions Binary random variable ~ 1 ; 0 1 1 Categorical distribtion: can take mltiple vales, 1,, 1

Learning the parameters of discrete Let denote the event of getting a head when tossing a coin ( Given a seqence of coin tosses, we estimate distribtions Let denote the otcome of rolling a die ( ) Given a seqence of rolls, we estimate Estimation sing sch simple conting performs what we call Maximm Likelihood Estimation (MLE). There are other methods as well.

Continos Random Variables A continos random variable can take any vale in an interval on the real line sally corresponds to some real-valed measrements, e.g., today s lowest temperatre The probability of a continos random variable taking an exact vale is typically zero: P(=56.2)=0 It is more meaningfl to measre the probability of a random variable taking a vale within an interval 50, 60 This is captred in the Probability density fnction

PDF: probability density fnction The probability of taking vale in a given range, is defined to be the area nder the PDF crve between and We often se to denote the PDF of Note: 0 can be larger than 1 1 15 20

What is the intitive meaning of? If When x is sampled from, yo are times more likely to see a vale near than that a vale near Higher probability (approximately 40% more likely) to see an x near 15, than an x that is near 20

Commonly Used Continos Distribtions f E.g., Sppose we know that in the last 10 mintes, one cstomer arrived at OSU federal conter #1. The actal time of arrival X can be modeled by a niform distribtion over the interval of (0, 10) f E.g., the body temp. of a person, the average IQ of a class of 1 st graders f E.g., the time between two sccessive forest fires

Conditional Probability P(A B) = probability of A being tre given that B is tre H = Have a headache F = Coming down with Fl F P(H) = 1/10 P(F) = 1/40 P(H F) = 1/2 Copyright Andrew W. Moore H Headaches are rare and fl is rarer, bt if yo re coming down with a fl there s a 50-50 chance yo ll have a headache.

Conditional Probability F H P(H F) = Area of H and F region ------------------------------ Area of F region H = Have a headache F = Coming down with Fl = P(H ^ F) ----------- P(F) P(H) = 1/10 P(F) = 1/40 P(H F) = 1/2 Copyright Andrew W. Moore

Definition of Conditional Probability Corollary: The Chain Rle Copyright Andrew W. Moore

Probabilistic Inference F H = Have a headache F = Coming down with Fl H P(H) = 1/10 P(F) = 1/40 P(H F) = 1/2 One day yo wake p with a headache. Yo think: Drat! 50% of fls are associated with headaches so I mst have a 50-50 chance of coming down with fl Is this reasoning good? Copyright Andrew W. Moore

Probabilistic Inference F P(F H) = H H = Have a headache F = Coming down with Fl P(H) = 1/10 P(F) = 1/40 P(H F) = 1/2 P(F H) = Copyright Andrew W. Moore

What we st did P(A B) P(A B) P(B) P(B A) = ----------- = --------------- P(A) P(A) This is Bayes Rle Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418 Copyright Andrew W. Moore

Using Bayes Rle to Gamble $1.00 The Win envelope has a dollar and for beads in it The Lose envelope has three beads and no money Trivial qestion: someone draws an envelope at random and offers to sell it to yo. How mch shold yo pay in order to not lose money on average? Copyright Andrew W. Moore

Using Bayes Rle to Gamble $1.00 The Win envelope has a dollar and for beads in it Copyright Andrew W. Moore The Lose envelope has three beads and no money Interesting qestion: before deciding, yo are allowed to see one bead drawn from the envelope. Sppose it s black: How mch shold yo pay? Sppose it s red: How mch shold yo pay?

Where are we? We have recalled the fndamentals of probability We have discssed conditional probability and Bayes rle Now we will move on to talk abot the Bayes classifier

Basic Idea Each example is described by inpt featres, i.e.,,,,, for now we assme they are discrete variables Each example belongs to one of possible classes, denoted by 1,, If we don t know anything abot the featres, a randomly drawn example has a fixed probability to belong to class If an example belongs to class, its featres,, will follow some particlar distribtion,, Given an example,,, a bayes classifier reasons abot the vale of sing Bayes rle:,,,,,,

Learning a Bayes Classifier,,,,,, Given a set of training data, : 1,,, we learn: Class Priors: for 1, 1 Class Conditional Distribtion:,, for 1,, P( 1,..., m #of class y ) examples that have vales ( #of total class examples Marginal Distribtion of :,, 1, 2,..., No need to learn, can be compted sing m )

Learning a oint distribtion Bild a JD table for yor attribtes in which the probabilities are nspecified A B C Prob 0 0 0? 0 0 1? 0 1 0? 0 1 1? 1 0 0? 1 0 1? 1 1 0? 1 1 1? The fill in each row with records matching row Pˆ (row) total nmber of records A B C Prob 0 0 0 0.30 0 0 1 0.05 0 1 0 0.10 0 1 1 0.05 1 0 0 0.05 1 0 1 0.10 1 1 0 0.25 Fraction of all records in which A and B are Tre bt C is False 1 1 1 0.10

Example of Learning a Joint This Joint was obtained by learning from three attribtes in the UCI Adlt Censs Database [Kohavi 1995] 48842 examples in total 12363 examples with this vale combination UCI machine learning repository: http://archive.ics.ci.ed/ml/datasets/adlt

Bayes Classifiers in a ntshell ) ( ),, ( argmax ),, ( ) ( ),, ( argmax ),, ( argmax 1 1 1 1 predict y P y P P y P y P y P y m m m m 1. Estimate as fraction of class examples for =1,,c. 2. Learn conditional oint distribtion,, for 1,, 3. To make a prediction for a new example,, learning

Example: Spam Filtering Use the Bag-of-words representation Assme a dictionary of words and tokens Create one binary attribte for each dictionary entry i.e., 1 means the -th word is sed in the email Consider a reasonable dictionary size: 5000 --- we have 5000 binary attribtes How many parameters that we need to learn? We need to learn the class prior (=1),(=0): 1 For each of the two classes, we need to learn a oint distribtion table of 5000 binary variables: 2 1 Total: 2 2 1 1 Clearly we don t have nearly enogh data to estimate that many parameters

Joint Distribtion Overfits It is common to enconter the following sitation: No training examples have the exact vale combinations 1, 2,., 0 for all vales of How do we make predictions for sch examples? To avoid overfitting Make some bold assmptions to simplify the oint distribtion

The Naïve Bayes Assmption Assme that each featre is independent of any other featres given the class label ) ( ) ( ),, ( 1 1 1 1 x P x P y x x P m m m m

A note abot independence Assme A and B are two random events. Then A and B are independent if and only if P(A B) = P(A)

Independence Theorems Assme P(A B) = P(A) Then P(A^B) = P(A) P(B) Assme P(A B) = P(A) Then P(B A) = P(B)

Examples of independent events Two separate coin tosses Consider the following for events: T: Toothache ( I have a toothache) C: Catch (dentist s steel probe catches in my tooth) A: Cavity (I have a cavity) W: Weather (weather is good) P(T, C, A, W) =P(T,C,A)P(W)

Conditional Independence P(A B,C) = P(A C) A and B are conditionally independent given C P(B A,C)=P(B C) If A and B are conditionally independent given C, then we have P(A,B C) = P(A C) P(B C)

Example of conditional independence T: Toothache ( I have a toothache) C: Catch (dentist s steel probe catches in my tooth) A: Cavity T and C are conditionally independent given A: P(T, C A) =P(T A)*P(C A) So, events that are not independent from each other might be conditionally independent given some fact It can also happen the other way arond. Events that are independent might become conditionally dependent given some fact. B=Brglar in yor hose; A = Alarm (Brglar) rang in yor hose E = Earthqake happened B is independent of E (ignoring some minor possible connections between them) However, if we know A is tre, then B and E are no longer independent. Why? P(B A) >> P(B A, E) Knowing E is tre makes it mch less likely for B to be tre

Naïve Bayes Classifier By assming that each attribte is independent of any other attribtes given the class label, we now have a Naïve Bayes Classifier Instead of learning a oint distribtion of all featres, we learn separately for each featre And we compte the oint by taking the prodct:

Naïve Bayes Classifiers in a ntshell m i i i m m m m y x P y P y P y x x P x x y P y 1 1 1 1 1 predict ) ( ) ( argmax ) ( ),, ( argmax ),, ( argmax 1. Learn the for each featre, and y vale 2. Estimate as fraction of records with. 3. For a new example,, : learning

Example X 1 X 2 X 3 Y 1 1 1 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 1 1 1 Apply Naïve Bayes, and make prediction for (1,0,1)?

Laplace Smoothing With the Naïve Bayes Assmption, we can still end p with zero probabilities E.g., if we receive an email that contains a word that has never appeared in the training emails 0for 0and 1 As sch we ignore all the other words in the email becase of this single rare word Laplace smoothing can help 10 # of ex. with 1, 0 # of ex. with 0 2 1

Final Notes abot (Naïve) Bayes Classifier Any density estimator can be plgged in to estimate P(x 1,x 2,, x m y) for Bayes, or P(x i y) for Naïve Bayes Real valed attribtes can be discretized or directly modeled sing simple continos distribtions sch as Gassian (Normal) distribtion Naïve Bayes is wonderflly cheap and srvives tens of thosands of attribtes easily