Lecture 8: Maximum Likelihood Estimation (MLE) (cont d.) Maximum a posteriori (MAP) estimation Naïve Bayes Classifier

Similar documents
Text Classification and Naïve Bayes

Nonlinearity & Preprocessing

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Introduction to Machine Learning CMU-10701

Probability Review and Naïve Bayes

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Introduction to Machine Learning

Bayesian Methods: Naïve Bayes

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Lecture 2: Probability, Naive Bayes

Lecture 10: Linear Discriminant Functions Perceptron. Aykut Erdem March 2016 Hacettepe University

Probabilistic Graphical Models

Generative v. Discriminative classifiers Intuition

Generative v. Discriminative classifiers Intuition

Classification, Linear Models, Naïve Bayes

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Point Estimation. Vibhav Gogate The University of Texas at Dallas

CSE546: Naïve Bayes Winter 2012

Naïve Bayes classification

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

MLE/MAP + Naïve Bayes

MLE/MAP + Naïve Bayes

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

CS 188: Artificial Intelligence Spring Today

Dimensionality reduction

CSE446: Naïve Bayes Winter 2016

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Point Estimation. Maximum likelihood estimation for a binomial distribution. CSE 446: Machine Learning

Naïve Bayes, Maxent and Neural Models

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Text Classification and Naïve Bayes

Machine Learning

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Bayesian Models in Machine Learning

COMP 328: Machine Learning

The Naïve Bayes Classifier. Machine Learning Fall 2017

CSC321 Lecture 18: Learning Probabilistic Models

Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Outline. Supervised Learning. Hong Chang. Institute of Computing Technology, Chinese Academy of Sciences. Machine Learning Methods (Fall 2012)

Naïve Bayes. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Announcements. Proposals graded

The Bayes classifier

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Machine Learning for natural language processing

Generative Classifiers: Part 1. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Michael Guerzhoy and Lisa Zhang

Announcements. Proposals graded

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Estimating Parameters

CS 361: Probability & Statistics

Midterm Review CS 6375: Machine Learning. Vibhav Gogate The University of Texas at Dallas

Machine Learning for Computational Advertising

Machine Learning

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, What about continuous variables?

Classification & Information Theory Lecture #8

Machine Learning

CS 188: Artificial Intelligence. Machine Learning

Midterm Review CS 7301: Advanced Machine Learning. Vibhav Gogate The University of Texas at Dallas

COMP90051 Statistical Machine Learning

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 3 September 14, Readings: Mitchell Ch Murphy Ch.

Machine Learning, Fall 2012 Homework 2

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Computational Cognitive Science

Introduction to Machine Learning

Machine Learning: Assignment 1

Computational Cognitive Science

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Parametric Models: from data to models

Bayesian Learning (II)

UVA CS / Introduc8on to Machine Learning and Data Mining

Generative Learning algorithms

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Bias-Variance Tradeoff

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

Generative Clustering, Topic Modeling, & Bayesian Inference

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

Machine Learning

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

CS 6140: Machine Learning Spring 2016

Generative Models for Discrete Data

Introduction to Probabilistic Machine Learning

Probability and Estimation. Alan Moses

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Bayesian Methods. David S. Rosenberg. New York University. March 20, 2018

Machine Learning CSE546 Sham Kakade University of Washington. Oct 4, What about continuous variables?

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

CPSC 340: Machine Learning and Data Mining

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Loss Functions, Decision Theory, and Linear Models

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Transcription:

Lecture 8: Maximum Likelihood Estimation (MLE) (cont d.) Maximum a posteriori (MAP) estimation Naïve Bayes Classifier Aykut Erdem March 2016 Hacettepe University

Flipping a Coin Last time Flipping a Coin I have a coin, if I flip it, what s the probability that it will fall with the head up? Let us flip it a few times to estimate the probability: slide by Barnabás Póczos & Alex Smola The estimated probability is: 3/5 Frequency of heads 13 2

Last time Flipping a Coin The estimated probability is: 3/5 Frequency of heads slide by Barnabás Póczos & Alex Smola Questions: (1) Why frequency of heads??? (2) How good is this estimation??? (3) Why is this a machine learning problem??? We are going to answer these questions 3

Question (1) Why frequency of heads??? Frequency of heads is exactly the maximum likelihood estimator for this problem MLE has nice properties (interpretation, statistical guarantees, simple) slide by Barnabás Póczos & Alex Smola 4

MLE distribution MLEfor forbernoulli Bernoulli distribution Data, D = P(Heads) = θ, P(Tails) = 1-θ Flips are i.i.d.: slide by Barnabás Póczos & Alex Smola Independent events Identically distributed according to Bernoulli distribution MLE: Choose θ that maximizes the probability of observed data 5 17

MLE distribution MLEfor forbernoulli Bernoulli distribution Data, D = P(Heads) = θ, P(Tails) = 1-θ Flips are i.i.d.: slide by Barnabás Póczos & Alex Smola Independent events Identically distributed according to Bernoulli distribution MLE: Choose θ that maximizes the probability of observed data 6 17

MLE distribution MLEfor forbernoulli Bernoulli distribution Data, D = P(Heads) = θ, P(Tails) = 1-θ Flips are i.i.d.: slide by Barnabás Póczos & Alex Smola Independent events Identically distributed according to Bernoulli distribution MLE: Choose θ that maximizes the probability of observed data 7 17

MLE distribution MLEfor forbernoulli Bernoulli distribution Data, D = P(Heads) = θ, P(Tails) = 1-θ Flips are i.i.d.: slide by Barnabás Póczos & Alex Smola Independent events Identically distributed according to Bernoulli distribution MLE: Choose θ that maximizes the probability of observed data 8 17

Maximum Likelihood Maximum Likelihood Estimation Estimation MLE: Choose θ that maximizes the probability of observed data Maximum Likelihood Estimation Independent independentdraws draws ose θ that maximizes the probability of observed data iden,cally Identically distributed distributed slide by Barnabás Póczos & Alex Smola 18 9

Maximum Likelihood Maximum Likelihood Estimation Estimation MLE: Choose θ that maximizes the probability of observed data Maximum Likelihood Estimation Independent draws independent draws ose θ that maximizes the probability of observed data iden,cally Identically distributed distributed slide by Barnabás Póczos & Alex Smola 18 10

Maximum Likelihood Maximum Likelihood Estimation Estimation MLE: Choose θ that maximizes the probability of observed data Maximum Likelihood Estimation Independent draws independent draws ose θ that maximizes the probability of observed data identically Identically distributed distributed slide by Barnabás Póczos & Alex Smola 18 11

Maximum Likelihood Maximum Likelihood Estimation Estimation MLE: Choose θ that maximizes the probability of observed data Maximum Likelihood Estimation Independent draws independent draws ose θ that maximizes the probability of observed data identically Identically distributed distributed slide by Barnabás Póczos & Alex Smola 18 12

Maximum Likelihood Maximum Likelihood Estimation Estimation MLE: Choose θ that maximizes the probability of observed data Maximum Likelihood Estimation Independent draws independent draws ose θ that maximizes the probability of observed data identically Identically distributed distributed slide by Barnabás Póczos & Alex Smola 18 13

Maximum Maximum Likelihood Likelihood Estimation Maximum Estimation Likelihood Estimation Estimation LE: Choose θ that maximizes the probability of observed data MLE: MLE:Choose Choose θthat thatmaximizes maximizesthe theprobability probability of of observed observed data data Independent draws Identically distributed slide by Barnabás Póczos & Alex Smola That s exactly the Frequency of heads 4818 14

Maximum Maximum Likelihood Likelihood Estimation Maximum Estimation Likelihood Estimation Estimation LE: Choose θ that maximizes the probability of observed data MLE: MLE:Choose Choose θthat thatmaximizes maximizesthe theprobability probability of of observed observed data data Independent draws Identically distributed slide by Barnabás Póczos & Alex Smola That s exactly the Frequency of heads 4818 15

Maximum Maximum Likelihood Likelihood Estimation Maximum Estimation Likelihood Estimation Estimation LE: Choose θ that maximizes the probability of observed data MLE: MLE:Choose Choose θthat thatmaximizes maximizesthe theprobability probability of of observed observed data data Independent draws Identically distributed slide by Barnabás Póczos & Alex Smola That s exactly the Frequency of heads 4818 16

Question (2) How good is this MLE estimation??? slide by Barnabás Póczos & Alex Smola 17

How many flips do I need? I flipped the coins 5 times: 3 heads, 2 tails What if I flipped 30 heads and 20 tails? slide by Barnabás Póczos & Alex Smola Which estimator should we trust more? The more the merrier??? 18

Simple bound Let θ* be the true parameter. For n = αh+αt, and For any ε>0: Hoeffding s inequality: slide by Barnabás Póczos & Alex Smola 19

Probably Approximate Correct (PAC) Learning I want to know the coin parameter θ, within ε = 0.1 error with probability at least 1-δ = 0.95. How many flips do I need? slide by Barnabás Póczos & Alex Smola Sample complexity: 20

Question (3) Why is this a machine learning problem??? improve their performance (accuracy of the predicted prob. ) at some task (predicting the probability of heads) with experience (the more coins we flip the better we are) slide by Barnabás Póczos & Alex Smola 21

What continuous What about about continuous features? features? 3 4 5 6 7 8 9 Let us try Gaussians slide by Barnabás Póczos & Alex Smola σ2 µ=0 σ2 µ=0 25 22

MLE for Gaussian mean and variance Choose θ= (µ,σ 2 ) that maximizes the probability of observed data Independent draws Identically distributed slide by Barnabás Póczos & Alex Smola 23

MLE mean MLEfor forgaussian Gaussian mean and variance variance and slide by Barnabás Póczos & Alex Smola Note: MLE for the variance of a Gaussian is biased Note: MLE for result the variance Gaussian is biased [Expected of estimationofis a not the true parameter!] [Expected result of estimation is not the true parameter!] Unbiased variance estimator: Unbiased variance estimator: 24 27

What about prior knowledge? (MAP Estimation) 25

What about prior knowledge? We know the coin is close to 50-50. What can we do now? The Bayesian way Rather than estimating a single θ, we obtain a distribution over possible values of θ Before data After data 50-50 26

Prior distribution What prior? What distribution do we want for a prior? Represents expert knowledge (philosophical approach) Simple posterior form (engineer s approach) Uninformative priors: Uniform distribution Conjugate priors: Closed-form representation of posterior P(θ) and P(θ D) have the same form 27

In order to proceed we will need: Bayes Rule 28

Chain Rule Rule & & Bayes Bayes Rule Chain Rule Chain rule: Bayes rule: Bayes ruleisisimportant important for Bayes rule forreverse reverseconditioning. conditioning. 32 29

Bayesian Learning Use Bayes rule: Or equivalently: posterior likelihood prior 30

MAP estimation for Binomial distribution Coin flip problem Likelihood is Binomial If the prior is Beta distribution, ) posterior is Beta distribution P( ) and P( D) have the same form! [Conjugate prior] 31

Beta distribution More concentrated as values of α, β increase 32

Beta conjugate prior As n = α H + αt increases As we get more samples, effect of prior is washed out 33

Han Solo and Bayesian Priors C3PO: Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1! Han: Never tell me the odds! https://www.countbayesie.com/blog/2015/2/18/hans-solo-and-bayesian-priors 34

MLE vs. MAP! Maximum Likelihood estimation (MLE) Choose value that maximizes the probability of observed data! Maximum a posteriori (MAP) estimation Choose value that is most probable given observed data and prior belief When is MAP same as MLE? 35

From Binomial to Multinomial Example: Dice roll problem (6 outcomes instead of 2) Likelihood is ~ Multinomial(θ = {θ 1, θ 2,..., θ k }) If prior is Dirichlet distribution, et distribution, Then posterior is Dirichlet distribution For Multinomial, conjugate prior is Dirichlet distribution. http://en.wikipedia.org/wiki/dirichlet_distribution 36

Bayesians vs. Frequentists You are no good when sample is small You give a different answer for different priors 37

Recap: What about prior knowledge? (MAP Estimation) 38

Recap: What about prior knowledge? We know the coin is close to 50-50. What can we do now? The Bayesian way Rather than estimating a single θ, we obtain a distribution over possible values of θ Before data After data 50-50 39

Recap: Chain Rule & Bayes Rule Chain rule: Bayes rule: 40

Recap: Bayesian Learning D is the measured data Our goal is to estimate parameter θ Use Bayes rule: Or equivalently: posterior likelihood prior 41

Recap: MAP estimation for Binomial distribution In the coin flip problem: Likelihood is Binomial: If the prior is Beta: then the posterior is Beta distribution 42

Recap: Beta conjugate prior As n = α H + αt increases As we get more samples, effect of prior is washed out 43

Application of Bayes Rule 44

AIDS test (Bayes rule) Data Approximately 0.1% are infected Test detects all infections Test reports positive for 1% healthy people Probability of having AIDS if test is positive Only 9%!... 10 45

Improving the diagnosis Use a weaker follow-up test! Approximately 0.1% are infected Test 2 reports positive for 90% infections Test 2 reports positive for 5% healthy people = 64%!... 11 46

AIDS test (Bayes rule) Why can t we use Test 1 twice? Outcomes are not independent, but tests 1 and 2 conditionally independent (by assumption): 47

The Naïve Bayes Classifier 48

Data for spam filtering date time recipient path IP number sender encoding many more features R Delivered-To: alex.smola@gmail.com Received: by 10.216.47.73 with SMTP id s51cs361171web; Tue, 3 Jan 2012 14:17:53-0800 (PST) Received: by 10.213.17.145 with SMTP id s17mr2519891eba.147.1325629071725; Tue, 03 Jan 2012 14:17:51-0800 (PST) Return-Path: <alex+caf_=alex.smola=gmail.com@smola.org> Received: from mail-ey0-f175.google.com (mail-ey0-f175.google.com [209.85.215.175]) by mx.google.com with ESMTPS id n4si29264232eef.57.2012.01.03.14.17.51 (version=tlsv1/sslv3 cipher=other); Tue, 03 Jan 2012 14:17:51-0800 (PST) Received-SPF: neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of alex+caf_=alex.smola=gmail.com@smola.org) clientip=209.85.215.175; Authentication-Results: mx.google.com; spf=neutral (google.com: 209.85.215.175 is neither permitted nor denied by best guess record for domain of alex +caf_=alex.smola=gmail.com@smola.org) smtp.mail=alex+caf_=alex.smola=gmail.com@smola.org; dkim=pass (test mode) header.i=@googlemail.com Received: by eaal1 with SMTP id l1so15092746eaa.6 for <alex.smola@gmail.com>; Tue, 03 Jan 2012 14:17:51-0800 (PST) Received: by 10.205.135.18 with SMTP id ie18mr5325064bkc.72.1325629071362; Tue, 03 Jan 2012 14:17:51-0800 (PST) X-Forwarded-To: alex.smola@gmail.com X-Forwarded-For: alex@smola.org alex.smola@gmail.com Delivered-To: alex@smola.org Received: by 10.204.65.198 with SMTP id k6cs206093bki; Tue, 3 Jan 2012 14:17:50-0800 (PST) Received: by 10.52.88.179 with SMTP id bh19mr10729402vdb.38.1325629068795; Tue, 03 Jan 2012 14:17:48-0800 (PST) Return-Path: <althoff.tim@googlemail.com> Received: from mail-vx0-f179.google.com (mail-vx0-f179.google.com [209.85.220.179]) by mx.google.com with ESMTPS id dt4si11767074vdb.93.2012.01.03.14.17.48 (version=tlsv1/sslv3 cipher=other); Tue, 03 Jan 2012 14:17:48-0800 (PST) Received-SPF: pass (google.com: domain of althoff.tim@googlemail.com designates 209.85.220.179 as permitted sender) client-ip=209.85.220.179; Received: by vcbf13 with SMTP id f13so11295098vcb.10 for <alex@smola.org>; Tue, 03 Jan 2012 14:17:48-0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=googlemail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type; bh=wcbdz5sxac25dph02xcrydodts993hkwsavxpgrfh0w=; b=wk2b2+exwnf/gvtkw6uuvkup4xeoknljq3usytm0rark8dsfjyoqsiheap9yssxp6o 7ngGoTzYqd+ZsyJfvQcLAWp1PCJhG8AMcnqWkx0NMeoFvIp2HQooZwxSOCx5ZRgY+7qX uibbdna4ludxj6ufe16spldckptd8oz3gr7+o= MIME-Version: 1.0 Received: by 10.220.108.81 with SMTP id e17mr24104004vcp.67.1325629067787; Tue, 03 Jan 2012 14:17:47-0800 (PST) Sender: althoff.tim@googlemail.com Received: by 10.220.17.129 with HTTP; Tue, 3 Jan 2012 14:17:47-0800 (PST) Date: Tue, 3 Jan 2012 14:17:47-0800 X-Google-Sender-Auth: 6bwi6D17HjZIkxOEol38NZzyeHs Message-ID: <CAFJJHDGPBW+SdZg0MdAABiAKydDk9tpeMoDijYGjoGO-WC7osg@mail.gmail.com> Subject: CS 281B. Advanced Topics in Learning and Decision Making

Naïve Bayes Assumption Naïve Bayes assumption: Features X 1 and X 2 are conditionally independent given the class label Y: More generally: 50

Naïve Bayes Assumption, Example Task: Predict whether or not a picnic spot is enjoyable Training Data: X = (X 1 X 2 X 3 X d ) Y n rows Naïve Bayes assumption: How many parameters to estimate? (X is composed of d binary features, Y has K possible class labels) (2 d -1)K vs (2-1)dK 1 51

Naïve Bayes Classifier Given: Class prior P(Y) d conditionally independent features X 1, X d given the class label Y For each X i feature, we have the conditional likelihood P(X i Y) Naïve Bayes Decision rule: 52

Naïve Bayes Algorithm for discrete features Training data: n d-dimensional discrete features + K class labels We need to estimate these probabilities! Estimate them with MLE (Relative Frequencies)! 53

Naïve Bayes Algorithm for discrete features We need to estimate these probabilities! Estimators For Class Prior For Likelihood NB Prediction for test data: 19 54

Subtlety: Insufficient training data For example, What now??? 21 55

Naïve Bayes Alg Discrete features Training data: Use your expert knowledge & apply prior distributions: Add m virtual examples Same as assuming conjugate priors Assume priors: MAP Estimate: called Laplace smoothing # virtual examples with Y = b 56 22

Case Study: Text Classification 57

Is this spam? 58

Positive or negative movie review? unbelievably disappointing Full of zany characters and richly applied satire, and some great plot twists this is the greatest screwball comedy ever filmed It was pathetic. The worst part about it was the boxing scenes. slide by Dan Jurafsky 59

What is the subject of this article? MEDLINE Article MeSH Subject Category Hierarchy Antogonists and Inhibitors? Blood Supply Chemistry slide by Dan Jurafsky Drug Therapy Embryology Epidemiology 60

Text Classification Assigning subject categories, topics, or genres Spam detection Authorship identification Age/gender identification Language Identification Sentiment analysis slide by Dan Jurafsky 61

Text Classification: definition Input: - a document d - a fixed set of classes C = {c1, c2,, cj} Output: a predicted class c C slide by Dan Jurafsky 62

Hand-coded rules Rules based on combinations of words or other features - spam: black-list-address OR ( dollars AND have been selected ) Accuracy can be high - If rules carefully refined by expert But building and maintaining these rules is expensive slide by Dan Jurafsky 63

Text Classification and Naive Bayes Classify emails - Y = {Spam, NotSpam} Classify news articles - Y = {what is the topic of the article?} What are the features X? The text! Let X i represent i th word in the document 64

X i represents i th word in document 65

NB for Text Classification A problem: The support of P(X Y) is huge! Article at least 1000 words, X={X 1,,X 1000 } X i represents i th word in document, i.e., the domain of X i is the entire vocabulary, e.g., Webster Dictionary (or more). X i 2 {1,,50000} ) K(1000 50000-1) parameters to estimate without the NB assumption. 66

NB for Text Classification X i 2 {1,,50000} ) K(1000 50000-1) parameters to estimate. NB assumption helps a lot!!! If P(X i =x i Y=y) is the probability of observing word x i at the i th position in a document on topic y ) 1000K(50000-1) parameters to estimate with NB assumption NB assumption helps, but still lots of parameters to estimate. 6726

Bag of words model Typical additional assumption: Position in document doesn t matter: P(X i =x i Y=y) = P(X k =x i Y=y) Bag of words model order of words on the page ignored The document is just a bag of words: i.i.d. words Sounds really silly, but often works very well! ) K(50000-1) parameters to estimate The probability of a document with words x 1,x 2, 68 27

The bag of words representation I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet. γ( )=c slide by Dan Jurafsky 69

The bag of words representation γ( I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet. )=c slide by Dan Jurafsky 70

The bag of words representation: using a subset of words γ( x love xxxxxxxxxxxxxxxx sweet xxxxxxx satirical xxxxxxxxxx xxxxxxxxxxx great xxxxxxx xxxxxxxxxxxxxxxxxxx fun xxxx xxxxxxxxxxxxx whimsical xxxx romantic xxxx laughing xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxx recommend xxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx x several xxxxxxxxxxxxxxxxx xxxxx happy xxxxxxxxx again xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxx )=c slide by Dan Jurafsky 71

The bag of words representation great 2 γ( love 2 recommend 1 laugh 1 happy 1...... )=c slide by Dan Jurafsky 72

Doc Words Class ˆP(c) = N c N ˆP(w c) = count(w,c)+1 count(c)+ V Training 1 Chinese Beijing Chinese c 2 Chinese Chinese Shanghai c 3 Chinese Macao c 4 Tokyo Japan Chinese j Test 5 Chinese Chinese Chinese Tokyo Japan? slide by Dan Jurafsky Priors: P(c)= 3 4 P(j)= 1 4 Choosing a class: P(c d5) 3/4 * (3/7) 3 * 1/14 * 1/14 0.0003 Conditional Probabilities: P(Chinese c) = (5+1) / (8+6) = 6/14 = 3/7 P(Tokyo c) = (0+1) / (8+6) = 1/14 P(j d5) 1/4 * (2/9) 3 * 2/9 * 2/9 P(Japan c) = (0+1) / (8+6) = 1/14 0.0001 P(Chinese j) = (1+1) / (3+6) = 2/9 P(Tokyo j) = (1+1) / (3+6) = 2/9 P(Japan j) = (1+1) / (3+6) = 2/9 73

What if features are continuous? e.g., character recognition: Xi is intensity at i th pixel i Gaussian Naïve Bayes (GNB): Naïve Bayes (GNB): Different mean and variance for each class k and each pixel i. Sometimes assume variance is independent of Y (i.e., σi), or independent of Xi (i.e., σk) or both (i.e., σ) 74

Estimating parameters: Y discrete, X i continuous 75

Estimating parameters: Y discrete, X i continuous Maximum likelihood estimates: i th pixel in j th training image k th class j th training image 76

Twenty news groups results Naïve Bayes: 89% accuracy 77

Case Study: Classifying Mental States 78

Example: GNB for classifying mental states ~1 mm resolution ~2 images per sec. 15,000 voxels/image non-invasive, safe measures Blood Oxygen Level Dependent (BOLD) response [Mitchell et al.] 79

Brain scans can track activation with precision and sensitivity

Learned Naïve Bayes Models Means for P(BrainActivity WordCategory) Tool words Pairwise classification accuracy: 78-99%, 12 participants Building Building words [Mitchell et al.] 81

What you should know Naïve Bayes classifier What s the assumption Why we use it How do we learn it Why is Bayesian (MAP) estimation important Text classification Bag of words model Gaussian NB Features are still conditionally independent Each feature has a Gaussian distribution given class 82