CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Similar documents
SDS 321: Introduction to Probability and Statistics

What is Probability? Probability. Sample Spaces and Events. Simple Event

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

Deep Learning for Computer Vision

NLP: Probability. 1 Basics. Dan Garrette December 27, E : event space (sample space)

Lecture Lecture 5


Conditional Probability

Lecture Notes 1 Basic Probability. Elements of Probability. Conditional probability. Sequential Calculation of Probability

Lecture 1 : The Mathematical Theory of Probability

Mathematics. ( : Focus on free Education) (Chapter 16) (Probability) (Class XI) Exercise 16.2

LECTURE 1. 1 Introduction. 1.1 Sample spaces and events

the time it takes until a radioactive substance undergoes a decay

MODULE 2 RANDOM VARIABLE AND ITS DISTRIBUTION LECTURES DISTRIBUTION FUNCTION AND ITS PROPERTIES

2/3/04. Syllabus. Probability Lecture #2. Grading. Probability Theory. Events and Event Spaces. Experiments and Sample Spaces

Lecture 3 - Axioms of Probability

Probability Theory for Machine Learning. Chris Cremer September 2015

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

TA Qinru Shi: Based on poll result, the first Programming Boot Camp will be: this Sunday 5 Feb, 7-8pm Gates 114.

(i) Given that a student is female, what is the probability of having a GPA of at least 3.0?

Discrete Structures for Computer Science

CS206 Review Sheet 3 October 24, 2018

Lecture 1. ABC of Probability

Mutually Exclusive Events

4/17/2012. NE ( ) # of ways an event can happen NS ( ) # of events in the sample space

Mean, Median and Mode. Lecture 3 - Axioms of Probability. Where do they come from? Graphically. We start with a set of 21 numbers, Sta102 / BME102

Conditional Probability and Bayes Theorem (2.4) Independence (2.5)

Probability deals with modeling of random phenomena (phenomena or experiments whose outcomes may vary)

Dept. of Linguistics, Indiana University Fall 2015

STAT 430/510 Probability

Probability: Terminology and Examples Class 2, Jeremy Orloff and Jonathan Bloom

Statistical Theory 1

Introduction to Probability and Sample Spaces

Introduction to Probability

Events A and B are said to be independent if the occurrence of A does not affect the probability of B.

EE 178 Lecture Notes 0 Course Introduction. About EE178. About Probability. Course Goals. Course Topics. Lecture Notes EE 178

CS 361: Probability & Statistics

LECTURE NOTES by DR. J.S.V.R. KRISHNA PRASAD

First Digit Tally Marks Final Count

Probability Lecture #7

CS 361: Probability & Statistics

Introduction Probability. Math 141. Introduction to Probability and Statistics. Albyn Jones

Probability Pearson Education, Inc. Slide

Outline Conditional Probability The Law of Total Probability and Bayes Theorem Independent Events. Week 4 Classical Probability, Part II

6.2 Introduction to Probability. The Deal. Possible outcomes: STAT1010 Intro to probability. Definitions. Terms: What are the chances of?

PROBABILITY. Admin. Assignment 0 9/9/14. Assignment advice test individual components of your regex first, then put them all together write test cases

Origins of Probability Theory

Chapter 2. Conditional Probability and Independence. 2.1 Conditional Probability

Fundamentals of Probability CE 311S

Why should you care?? Intellectual curiosity. Gambling. Mathematically the same as the ESP decision problem we discussed in Week 4.

Chapter 2.5 Random Variables and Probability The Modern View (cont.)

What is a random variable

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

CSCI 5832 Natural Language Processing. Today 1/31. Probability Basics. Lecture 6. Probability. Language Modeling (N-grams)

PERMUTATIONS, COMBINATIONS AND DISCRETE PROBABILITY

CSC Discrete Math I, Spring Discrete Probability

Classification & Information Theory Lecture #8

Probabilistic models

Statistical methods in recognition. Why is classification a problem?

CISC 1100/1400 Structures of Comp. Sci./Discrete Structures Chapter 7 Probability. Outline. Terminology and background. Arthur G.

Chapter 2 PROBABILITY SAMPLE SPACE

Formal Modeling in Cognitive Science

Formal Modeling in Cognitive Science Lecture 19: Application of Bayes Theorem; Discrete Random Variables; Distributions. Background.

Unit 1: Sequence Models

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides

Machine Learning Algorithm. Heejun Kim

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

MATH MW Elementary Probability Course Notes Part I: Models and Counting

Probability the chance that an uncertain event will occur (always between 0 and 1)

Marquette University MATH 1700 Class 5 Copyright 2017 by D.B. Rowe

Probability Rules. MATH 130, Elements of Statistics I. J. Robert Buchanan. Fall Department of Mathematics

Probability (10A) Young Won Lim 6/12/17

V. RANDOM VARIABLES, PROBABILITY DISTRIBUTIONS, EXPECTED VALUE

Introduction to Machine Learning

BASICS OF PROBABILITY CHAPTER-1 CS6015-LINEAR ALGEBRA AND RANDOM PROCESSES

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan

Introduction to Machine Learning

Probability Theory Review

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

CS 188: Artificial Intelligence Spring Today

Homework 4 Solution, due July 23

Lecture 1: Probability Fundamentals

Probability Review and Naïve Bayes

Probability Notes (A) , Fall 2010

27 Binary Arithmetic: An Application to Programming

Expected Value. Lecture A Tiefenbruck MWF 9-9:50am Center 212 Lecture B Jones MWF 2-2:50pm Center 214 Lecture C Tiefenbruck MWF 11-11:50am Center 212

Independence Solutions STAT-UB.0103 Statistics for Business Control and Regression Models

2. Conditional Probability

k P (X = k)

Behavioral Data Mining. Lecture 2

Decision Trees. Nicholas Ruozzi University of Texas at Dallas. Based on the slides of Vibhav Gogate and David Sontag

Probability- describes the pattern of chance outcomes

Uncertainty. Russell & Norvig Chapter 13.

Chapter 2. Conditional Probability and Independence. 2.1 Conditional Probability

CS 441 Discrete Mathematics for CS Lecture 20. Probabilities. CS 441 Discrete mathematics for CS. Probabilities

4th IIA-Penn State Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur

Econ 325: Introduction to Empirical Economics

Naïve Bayes classification

ECE 450 Lecture 2. Recall: Pr(A B) = Pr(A) + Pr(B) Pr(A B) in general = Pr(A) + Pr(B) if A and B are m.e. Lecture Overview

Transcription:

CS4705 Probability Review and Naïve Bayes Slides from Dragomir Radev

Classification using a Generative Approach Previously on NLP discriminative models P C D here is a line with all the social media posts on one side and the scientific articles on the other side; which side is this example on? Now generative models P C, D here are some characteristics of social media posts, and here are some characteristics of scientific articles; which is this example more like?

Classification using a Generative Approach We ll look in detail at the Naïve Bayes classifier and Maximum Likelihood Expectation But we need some background in probability first

Probabilities in NLP Speech recognition: recognize speech vs wreck a nice beach Machine translation: l avocat general : the attorney general vs. the general avocado Information retrieval: If a document includes three occurrences of stir and one of rice, what is the probability that it is a recipe? Probabilities make it possible to combine evidence from multiple sources systematically

Probability Theory Random experiment (trial): an experiment with uncertain outcome e.g., flipping a coin, picking a word from text Sample space: the set of all possible outcomes for an experiment e.g., flipping 2 fair coins, = {HH, HT, TH, TT} Event: a subset of the sample space, E E happens iff the outcome is in E, e.g., E = {HH} (all heads) E = {HH, TT} (same face)

Events Probability of Event : 0 P(E) 1, s.t. P(A B) = P(A) + P(B), if (A B) = e.g., A=same face, B=different face is the impossible event (empty set) P( ) = 0 is the certain event (entire sample space) P( ) = 1

Example: Roll a Die Sample space: = {1,2,3,4,5,6} Fair die: P 1 = P 2 = = P(6) = 1/6 Unfair die: P 1 = 0.3, P 2 = 0.2, N-dimensional die: = {1, 2, 3, 4,, N} Example in modeling text: Roll a die to decide which word to write in the next position = {cat, dog, tiger, }

Example: Flip a Coin Sample space: = {Heads, Tails} Fair coin: P(H) = 0.5, P(T) = 0.5 Unfair coin: P(H) = 0.3, P(T) = 0.7 Flipping three coins: = {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT} Example in modeling text: Flip a coin to decide whether or not to include a word in a document Sample space = {appear, absence}

Probabilities Probability distribution a function that distributes a probability mass of 1 throughout the sample space 0 P(ω) 1 for each outcome ω P ω ω Ω = 1 Probability of an event E P E = P(ω) ω E Example: a fair coin is flipped three times What is the probability of 3 heads? What is the probability of 2 heads?

Probabilities Joint probability: P A B also written as P(A, B) Conditional probability: P A B = P(A B) P(B) A B A B

Conditional Probability P(B A) = P(A B) P(A) P(A B) = P(A)P(B A) = P(B)P(A B) So, P A B = P(B A)P(A) P(B) (Bayes Rule) For independent events, P(A B) = P(A)P(B), so P(A B) = P(A)

Conditional Probability Six-sided fair die P D even =? 1/2 P(D 4) =? 1/2 P(D even D 4) =? 2/6 1/2 = 2/3 P(D odd D 4) =? 1/6 1/2 = 1/3 Multiple conditions: P(D odd D 4, D 5) =? 1/6 2/6 = 1/2

Independence Two events are independent when P(A B) = P(A)P(B) Unless P(B) = 0 this is equivalent to saying that P(A) = P(A B) If two events are not independent, they are considered dependent

Independence What are some examples of independent events? What about dependent events?

Response

Naïve Bayes Classifier We use Baye s rule: P C D = Here C = Class, D = Document P D C P C P D We can simplify and ignore P(D) since it is independent of class choice P C D P D C P C P C P w i C i=1,n This estimates the probability of D being in class C assuming that D has n tokens and w is a token in D.

But Wait What is D? D = w 1 w 2 w 3 w n So what is P(D C), really? P D C = P w 1 w 2 w 3 w n C But w 1 w 2 w 3 w n is not in our training set so we don t know its probability How can we simplify this?

Conditional Probability Revisited Recall the definition of conditional probability 1. P(A B) = P(AB), or equivalently P(B) 2. P AB = P A B P(B) What if we have more than two events? P ABC N = P A BC N P B C N P M N P(N) This is the chain rule for probability We can prove this rule by induction on N

Independence Assumption So what is P D C? = P w 1 w 2 w 3 w n C = P w 1 w 2 w 3 w n C P w 2 w 3 w n C P w n C P(C) This is still not very helpful We make the naïve assumption that all words occur independently of each other Recall that for independent events w 1 and w 2 we have P(w 1 w 2 ) = P(w 1 ) That s this step! P D C P w i C i=1,n

Independence Assumptions Is the Naïve Bayes assumption a safe assumption? What are some examples of words that might be dependent on other words?

Response

Using Labeled Training Data P(C D) P C i=1,n P w i C P C = D c D the number of training documents with label C divided by the total number of training documents P(w i C) = Count(w i C) v i V Count(v i C) the number of times word w i occurs with label C divided by the number of times all words in the vocabulary V occur with label C This is the maximum-likelihood estimate (MLE)

Using Labeled Training Data Can you think of ways to improve this model? Some issues to consider What if there are words that do not appear in the training set? What if the plural of a word never appears in the training set? How are extremely common words (e.g., the, a ) handled?

Response

A Quick Note on the MLE The counts seem intuitively right, but how do we know for sure? We are trying to find values of P(C) andp w i C that maximize the likelihood of the training set i.e. we want the largest possible value of P(T) = t T P c t w i t P(w i c t ) Here T is the training set and t is a training example We can find these values by taking the log, then taking the derivative, then solving for 0

Questions?

Reading for next time C 5.1 5.5, Speech and Language