Lecture 2: Probability, Naive Bayes

Similar documents
Lecture 2: Probability and Language Models

Bayes Rule. CS789: Machine Learning and Neural Network Bayesian learning. A Side Note on Probability. What will we learn in this lecture?

Probability Review and Naïve Bayes

Text Classification and Naïve Bayes

Probabilistic Graphical Models

Naïve Bayes, Maxent and Neural Models

Bayesian Methods: Naïve Bayes

Machine Learning for natural language processing

Lecture 8: Maximum Likelihood Estimation (MLE) (cont d.) Maximum a posteriori (MAP) estimation Naïve Bayes Classifier

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Naïve Bayes. Vibhav Gogate The University of Texas at Dallas

Data Mining 2017 Text Classification Naive Bayes

The Naïve Bayes Classifier. Machine Learning Fall 2017

Midterm sample questions

Classification & Information Theory Lecture #8

Midterm sample questions

Loss Functions, Decision Theory, and Linear Models

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Applied Natural Language Processing

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Naïve Bayes classification

Classification, Linear Models, Naïve Bayes

Probability Theory Review

Language as a Stochastic Process

Naïve Bayes Classifiers

Topics. Bayesian Learning. What is Bayesian Learning? Objectives for Bayesian Learning

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

(COM4513/6513) Week 1. Nikolaos Aletras ( Department of Computer Science University of Sheffield

Natural Language Processing

10/17/04. Today s Main Points

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Modeling Environment

Naïve Bayes Classifiers and Logistic Regression. Doug Downey Northwestern EECS 349 Winter 2014

COMP 328: Machine Learning

Directed Probabilistic Graphical Models CMSC 678 UMBC

Multi-theme Sentiment Analysis using Quantified Contextual

Naive Bayes Classifier. Danushka Bollegala

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

MLE/MAP + Naïve Bayes

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Introduction to Bayesian Learning

Machine Learning

Probability Review. September 25, 2015

Generative Clustering, Topic Modeling, & Bayesian Inference

Pattern Recognition and Machine Learning. Learning and Evaluation of Pattern Recognition Processes

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Machine Learning Algorithm. Heejun Kim

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Lecture 13: Structured Prediction

An introduction to PRISM and its applications

Lecture : Probabilistic Machine Learning

Sequences and Information

Computational Cognitive Science

CS 188: Artificial Intelligence Spring Today

Behavioral Data Mining. Lecture 2

Fast Logistic Regression for Text Categorization with Variable-Length N-grams

Learning with Probabilities

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Bayesian Learning Extension

Probability Based Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Bayesian Learning (II)

Generative Models for Discrete Data

Computational Cognitive Science

CS 188: Artificial Intelligence. Machine Learning

Bayesian Inference. Definitions from Probability: Naive Bayes Classifiers: Advantages and Disadvantages of Naive Bayes Classifiers:

Data Mining 2018 Logistic Regression Text Classification

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Final Exam. December 11 th, This exam booklet contains five problems, out of which you are expected to answer four problems of your choice.

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Bayesian Approaches Data Mining Selected Technique

naive bayes document classification

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes

CSCE 478/878 Lecture 6: Bayesian Learning

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Text Categorization CSE 454. (Based on slides by Dan Weld, Tom Mitchell, and others)

UVA CS / Introduc8on to Machine Learning and Data Mining

Intelligent Systems (AI-2)

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Conditional Random Field

Estimating Parameters

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

CS Lecture 18. Topic Models and LDA

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Machine Learning. Classification. Bayes Classifier. Representing data: Choosing hypothesis class. Learning: h:x a Y. Eric Xing

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Lecture 4 Logistic Regression

Be able to define the following terms and answer basic questions about them:

Generative Models. CS4780/5780 Machine Learning Fall Thorsten Joachims Cornell University

MLE/MAP + Naïve Bayes

Transcription:

Lecture 2: Probability, Naive Bayes CS 585, Fall 205 Introduction to Natural Language Processing http://people.cs.umass.edu/~brenocon/inlp205/ Brendan O Connor

Today Probability Review Naive Bayes classification Python demo 2

Probability Theory Review = X a P (A = a) Conditional Probability Chain Rule P (A B) = P (AB) P (B) P (AB) =P (A B)P (B) Law of Total Probability P (A, B = b) P (A B = b)p (B = b) Disjunction (Union) P (A _ B) =P (A)+P (B) P (AB) Negation (Complement) P ( A) = 3 P (A)

Probability Theory Review = X a P (A = a) Conditional Probability Chain Rule P (A B) = P (AB) P (B) P (AB) =P (A B)P (B) Law of Total Probability P (A, B = b) P (A B = b)p (B = b) Disjunction (Union) P (A _ B) =P (A)+P (B) P (AB) Negation (Complement) P ( A) = 3 P (A)

Probability Theory Review = X a P (A = a) Conditional Probability Chain Rule P (A B) = P (AB) P (B) P (AB) =P (A B)P (B) Law of Total Probability P (A, B = b) P (A B = b)p (B = b) Disjunction (Union) P (A _ B) =P (A)+P (B) P (AB) Negation (Complement) P ( A) = 3 P (A)

Probability Theory Review = X a P (A = a) Conditional Probability Chain Rule P (A B) = P (AB) P (B) P (AB) =P (A B)P (B) Law of Total Probability P (A, B = b) P (A B = b)p (B = b) Disjunction (Union) P (A _ B) =P (A)+P (B) P (AB) Negation (Complement) P ( A) = 3 P (A)

Probability Theory Review = X a P (A = a) Conditional Probability Chain Rule P (A B) = P (AB) P (B) P (AB) =P (A B)P (B) Law of Total Probability P (A, B = b) P (A B = b)p (B = b) Disjunction (Union) P (A _ B) =P (A)+P (B) P (AB) Negation (Complement) P ( A) = 3 P (A)

Probability Theory Review = X a P (A = a) Conditional Probability Chain Rule P (A B) = P (AB) P (B) P (AB) =P (A B)P (B) Law of Total Probability P (A, B = b) P (A B = b)p (B = b) Disjunction (Union) P (A _ B) =P (A)+P (B) P (AB) Negation (Complement) P ( A) = 3 P (A)

Probability Theory Review = X a P (A = a) Conditional Probability Chain Rule P (A B) = P (AB) P (B) P (AB) =P (A B)P (B) Law of Total Probability P (A, B = b) P (A B = b)p (B = b) Disjunction (Union) P (A _ B) =P (A)+P (B) P (AB) Negation (Complement) P ( A) = 3 P (A)

Probability Theory Review = X a P (A = a) Conditional Probability Chain Rule P (A B) = P (AB) P (B) P (AB) =P (A B)P (B) Law of Total Probability P (A, B = b) P (A B = b)p (B = b) Disjunction (Union) P (A _ B) =P (A)+P (B) P (AB) Negation (Complement) P ( A) = 3 P (A)

Bayes Rule H: unknown P(D H) Model: how hypothesis causes data Bayesian inference P(H D) D: observed data / evidence Bayes Rule tells you how to flip the conditional. Useful if you assume a generative process for your data. Likelihood Prior P (H D) = P (D H)P (H) P (D) Posterior Normalizer 4 Rev. Thomas Bayes c. 70-76

Bayes Rule and its pesky denominator Likelihood Prior P (h d) = P (d h)p (h) P (d) = P P (d h)p (h) h 0 P (d h 0 )P (h 0 ) Constant w.r.t. h P (h d) / P (d h)p (h) Proportional to Implicitly for varying H. This notation is very common, though slightly ambiguous. Unnormalized posterior By itself does not sum to! 5

Bayes Rule for classification inference doc label y Assume a generative process, P(w y) Inference problem: given w, what is y? words w Authorship problem: classify a new text. Is it y=anna or y=barry? Observe w: Look at random word in the new text. It is abracadabra. P(y=A w=abracadabra)? P(y w) = P(w y) P(y) / P(w) P(y): Assume 50% prior prob. P(w y): Calculate from previous data abracadabra gesundheit Anna 5 per 000 words 6 per 000 words Barry 0 per 000 words per 000 words Table : Word frequencies for 6 wizards Anna and Barr

Bayes Rule as hypothesis vector scoring Sum to? a b c P (H = h) Prior Multiply P (E H = h) Likelihood P (E H = h)p (H = h) Unnorm. Posterior Normalize Z P (E H = h)p (H = h) Posterior 7

Bayes Rule as hypothesis vector scoring [0.2, 0.2, 0.6] a b c P (H = h) Prior Sum to? Yes Multiply P (E H = h) Likelihood P (E H = h)p (H = h) Unnorm. Posterior Normalize Z P (E H = h)p (H = h) Posterior 7

Bayes Rule as hypothesis vector scoring [0.2, 0.2, 0.6] a b c P (H = h) Prior Sum to? Yes Multiply [0.2, 0.05, 0.05] P (E H = h) Likelihood No P (E H = h)p (H = h) Unnorm. Posterior Normalize Z P (E H = h)p (H = h) Posterior 7

Bayes Rule as hypothesis vector scoring [0.2, 0.2, 0.6] a b c P (H = h) Prior Sum to? Yes Multiply [0.2, 0.05, 0.05] P (E H = h) Likelihood No [0.04, 0.0, 0.03] P (E H = h)p (H = h) Unnorm. Posterior No Normalize Z P (E H = h)p (H = h) Posterior 7

Bayes Rule as hypothesis vector scoring [0.2, 0.2, 0.6] a b c P (H = h) Prior Sum to? Yes Multiply [0.2, 0.05, 0.05] P (E H = h) Likelihood No [0.04, 0.0, 0.03] P (E H = h)p (H = h) Unnorm. Posterior No Normalize [0.500, 0.25, 0.375] Z P (E H = h)p (H = h) Posterior Yes 7

Text Classification with Naive Bayes 8

Classification problems Given text d, want to predict label y Is this restaurant review positive or negative? Is this email spam or not? Which author wrote this text? (Is this word a noun or verb?) d: documents, sentences, etc. y: discrete/categorical variable Goal: from training set of (d,y) pairs, learn a probabilistic classifier f(d) = P(y d) ( supervised learning ) 9

Features for model: Bag-of-words I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet! fairy always love it it to whimsical it and I seen are friend anyone happy dialogue adventure recommend who sweet of satirical it it I to movie but romantic I several yet again it the humor the seen would to scenes I the manages the fun I times and and about whenever while have conventions with RAFT 0 it I the to and seen yet would whimsical times sweet satirical adventure genre fairy humor have great Figure 6. Intuition of the multinomial naive Bayes classifier applied to a movie review. The position of the words is ignored (the bag of words assumption) and we make use of the frequency of each word. 6 5 4 3 3 2

Levels of linguistic structure Discourse Semantics CommunicationEvent(e) Agent(e, Alice) Recipient(e, Bob) SpeakerContext(s) TemporalBefore(e, s) S Syntax NP PP. VP NounPrp VerbPast Prep NounPrp Punct Words Alice talked to Bob. Morphology Characters talk -ed Alice talked to Bob.

Levels of linguistic structure Words Alice talked to Bob. Characters Alice talked to Bob. 2

Levels of linguistic structure Words are fundamental units of meaning Words Alice talked to Bob. Characters Alice talked to Bob. 2

Levels of linguistic structure Words are fundamental units of meaning and easily identifiable* *in some languages Words Alice talked to Bob. Characters Alice talked to Bob. 2

How to classify with words? Approach #: use a predefined dictionary (or make one up) Human Knowledge e.g. for sentiment... score += for each happy, awesome, cool score -= for each sad, awful, bad 3

How to classify with words? Approach #: use a predefined dictionary (or make one up) Human Knowledge e.g. for sentiment... score += for each happy, awesome, cool score -= for each sad, awful, bad Approach #2: use labeled documents Supervised Learning Learn which words correlate to positive vs. negative documents Use these correlations to classify new documents 3

Supervised learning Many Labeled Examples (d=..., y=b) (d=..., y=b) (d=..., y=a) Training Model Parameters Classify new texts (d=..., y=?) (d=..., y=?) y=a y=b 4

Supervised learning: Generative model Many Labeled Examples Training Model P(y) (d=..., y=b) Parameters P(d y) (d=..., y=b) (d=..., y=a) Classify new texts (d=..., y=?) (d=..., y=?) y=a y=b 5 P(y d)

Multinomial Naive Bayes P (y w..w T ) / P (y) P (w..w T y) Tokens in doc Predictions: Predict class or, predict prob of classes... arg max y P (Y = y w..w T )

Multinomial Naive Bayes P (y w..w T ) / P (y) P (w..w T y) Tokens in doc Y t P (w t y) the Naive Bayes assumption: conditional indep. Parameters: P (w y) for each document category y and wordtype w P (y) prior distribution over document categories y Learning: Estimate parameters as frequency ratios; e.g. P (w y, ) = #(w occurrences in docs with label y)+ #(tokens total across docs with label y)+v Predictions: Predict class or, predict prob of classes... arg max y P (Y = y w..w T )