Lecture 2: Probability and Language Models

Similar documents
Lecture 2: Probability, Naive Bayes

(today we are assuming sentence segmentation) Wednesday, September 10, 14

Probability Theory Review

Probability Review and Naïve Bayes

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

IBM Model 1 for Machine Translation

Classification & Information Theory Lecture #8

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Midterm sample questions

CRF Word Alignment & Noisy Channel Translation

Natural Language Processing (CSE 490U): Language Models

CS Lecture 4. Markov Random Fields

Naïve Bayes, Maxent and Neural Models

Language as a Stochastic Process

L14. 1 Lecture 14: Crash Course in Probability. July 7, Overview and Objectives. 1.2 Part 1: Probability

Lecture 3: ASR: HMMs, Forward, Viterbi

Bayesian Learning Extension

CS 188: Artificial Intelligence Spring 2009

The Noisy Channel Model and Markov Models

CS 630 Basic Probability and Information Theory. Tim Campbell

CS 188: Artificial Intelligence Spring Today

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Lecture 22 Exploratory Text Analysis & Topic Models

HMM part 1. Dr Philip Jackson

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Midterm sample questions

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Bayesian Analysis for Natural Language Processing Lecture 2

Conditional Random Field

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Introduction to Artificial Intelligence (AI)

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

Machine Learning for natural language processing

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

10/17/04. Today s Main Points

Bayesian Learning Features of Bayesian learning methods:

MLE/MAP + Naïve Bayes

The CS 5 Times. CS 5 Penguin Prepares Revenge

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Statistical Methods for NLP

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

The Naïve Bayes Classifier. Machine Learning Fall 2017

MLE/MAP + Naïve Bayes

26. Januar Introduction to Computational Semantics

Hidden Markov Models. Vibhav Gogate The University of Texas at Dallas

Ranked Retrieval (2)

Discrete Probability and State Estimation

Introduction to Machine Learning CMU-10701

Intelligent Systems (AI-2)

Announcements. CS 188: Artificial Intelligence Fall VPI Example. VPI Properties. Reasoning over Time. Markov Models. Lecture 19: HMMs 11/4/2008

Sum and Product Rules

Relationship between Least Squares Approximation and Maximum Likelihood Hypotheses

CSCE 478/878 Lecture 6: Bayesian Learning

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

Lecture 13 Fundamentals of Bayesian Inference

Knowledge Representation. Propositional logic.

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Mathematical Logic Part One

Machine Learning for natural language processing

Our Status in CSE 5522

Probability and Time: Hidden Markov Models (HMMs)

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Logical Form 5 Famous Valid Forms. Today s Lecture 1/26/10

CSC321 Lecture 15: Recurrent Neural Networks

CMPT Machine Learning. Bayesian Learning Lecture Scribe for Week 4 Jan 30th & Feb 4th

CPSC 540: Machine Learning

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

CS 188: Artificial Intelligence Fall 2011

CS 572: Information Retrieval

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Lecture 10: Introduction to reasoning under uncertainty. Uncertainty

CSC321 Lecture 7 Neural language models

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS 5522: Artificial Intelligence II

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Knowledge Representation. Propositional logic

Markov Models. CS 188: Artificial Intelligence Fall Example. Mini-Forward Algorithm. Stationary Distributions.

Sampling from Bayes Nets

CS 188: Artificial Intelligence. Machine Learning

Natural Language Processing. Statistical Inference: n-grams

We are here. Assembly Language. Processors Arithmetic Logic Units. Finite State Machines. Circuits Gates. Transistors

Computational Cognitive Science

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

CS 5522: Artificial Intelligence II

Statistical Methods for NLP

Computers playing Jeopardy!

CS70: Discrete Math and Probability. Slides adopted from Satish Rao, CS70 Spring 2016 June 20, 2016

Introduction to Bayesian Learning. Machine Learning Fall 2018

Intelligent Systems (AI-2)

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Forward algorithm vs. particle filtering

Behavioral Data Mining. Lecture 2

The Hill Cipher A Linear Algebra Perspective

CA320 - Computability & Complexity

Lecture 5 Neural models for NLP

Generative Clustering, Topic Modeling, & Bayesian Inference

Probabilistic Graphical Models

Transcription:

Lecture 2: Probability and Language Models Intro to NLP, CS585, Fall 2014 Brendan O Connor (http://brenocon.com) 1

Admin Waitlist Moodle access: Email me if you don t have it Did you get an announcement email? Piazza vs Moodle? Office hours today 2

Things today Homework: ambiguities Python demo Probability Review Language Models 3

4

Python demo [TODO link ipython-notebook demo] For next week, make sure you can run Python 2.7 (Built-in on Mac & Linux) IPython Notebook http://ipython.org/notebook.html Please familiarize yourself with it. Python 2.7, IPython 2.2.0 Nice to have: Matplotlib Python interactive interpreter Python scripts 5

6

Levels of linguistic structure Discourse Semantics CommunicationEvent(e) Agent(e, Alice) Recipient(e, Bob) SpeakerContext(s) TemporalBefore(e, s) S Syntax NP PP. VP NounPrp VerbPast Prep NounPrp Punct Words Alice talked to Bob. Morphology Characters talk -ed Alice talked to Bob. 7

Levels of linguistic structure Words are fundamental units of meaning and easily identifiable* *in some languages Words Alice talked to Bob. Characters Alice talked to Bob. 8

Probability theory 9

Review: definitions/laws 1= X a P (A = a) Conditional Probability Chain Rule P (A B) = P (AB) P (B) P (AB) =P (A B)P (B) Law of Total Probability P (A) = X b P (A) = X b P (A, B = b) P (A B = b)p (B = b) Disjunction (Union) P (A _ B) =P (A)+P (B) P (AB) Negation (Complement) P ( A) =1 10 P (A)

Bayes Rule Want P(H D) but only have P(D H) e.g. H causes D, or P(D H) is easy to measure... H: who wrote this document? Model: authors word probs Bayesian inference D: words Likelihood Prior P (H D) = Posterior P (D H)P (H) P (D) Normalizer 11 Rev. Thomas Bayes c. 1701-1761

Bayes Rule and its pesky denominator Likelihood Prior P (h d) = P (d h)p (h) P (d) = P P (d h)p (h) h 0 P (d h 0 )P (h 0 ) P (h d) = 1 Z P (d h)p (h) Z: whatever lets the posterior, when summed across h, to sum to 1 Zustandssumme, "sum over states" P (h d) / P (d h)p (h) Unnormalized posterior By itself does not sum to 1! 12 Proportional to (implicitly for varying H. This notation is very common, though slightly ambiguous.)

Bayes Rule: Discrete 0.0 0.4 0.8 [0.2, 0.2, 0.6] a b c P (H = h) Prior Sum to 1? Yes Multiply 0.0 0.4 0.8 [0.2, 0.05, 0.05] P (E H = h) Likelihood No 0.0 0.4 0.8 [0.04, 0.01, 0.03] P (E H = h)p (H = h) Unnorm. Posterior No Normalize 0.0 0.4 0.8 [0.500, 0.125, 0.375] 1 Z P (E H = h)p (H = h) Posterior Yes 13

Bayes Rule: Discrete, uniform prior 0.0 0.4 0.8 [0.33, 0.33, 0.33] a b c Uniform distribution: Uninformative prior P (H = h) Prior Sum to 1? Yes Multiply 0.0 0.4 0.8 [0.2, 0.05, 0.05] P (E H = h) Likelihood No Normalize 0.0 0.4 0.8 0.0 0.4 0.8 [0.066, 0.016, 0.016] [0.66, 0.16, 0.16] 1 Z P (E H = h)p (H = h) Unnorm. Posterior Uniform prior implies that posterior is just renormalized likelihood P (E H = h)p (H = h) Posterior No Yes 14

Bayes Rule for doc classification doc label y Assume a generative process, P(w y) Inference problem: given w, what is y? words w If we knew We could estimate P (w y) P (y w) / P (y)p (w y) abracadabra gesundheit Anna 5 per 1000 words 6 per 1000 words Barry 10 per 1000 words 1 per 1000 words Look at random word. It is abracadabra Table 1: Word frequencies for wizards Anna and Barr Assume 50% prior prob Prob author is Anna? 15

Bayes Rule for doc classification doc label y Assume a generative process, P(w y) Inference problem: given w, what is y? words w If we knew We could estimate P (w y) P (y w) / P (y)p (w y) abracadabra gesundheit Anna 5 per 1000 words 6 per 1000 words Barry 10 per 1000 words 1 per 1000 words Table 1: Word frequencies for wizards Anna and Barr Look at two random words. w1 = abracadabra w2 = gesundheit Assume 50% prior prob Prob author is Anna? 16

Bayes Rule for doc classification doc label y Assume a generative process, P(w y) Inference problem: given w, what is y? words w If we knew We could estimate P (w y) P (y w) / P (y)p (w y) Chain rule: abracadabra gesundheit Anna 5 per 1000 words 6 per 1000 words Barry 10 per 1000 words 1 per 1000 words Table 1: Word frequencies for wizards Anna and Barr P(w1, w2 y) = P(w1 w2 y) P(w2 y) ASSUME conditional independence: P(w1, w2 y) = P(w1 y) P(w2 y) Look at two random words. w1 = abracadabra w2 = gesundheit Assume 50% prior prob Prob author is Anna?

Cond indep. assumption: Naive Bayes doc label y Assume a generative process, P(w y) Inference problem: given w, what is y? TY words w P (w 1...w T y) = P (w t y) t=1 each w t 2 1..V V = vocabulary size Generative story ( Multinom NB [McCallum & Nigam 1998]): - For each token t in the document, - Author chooses a word by rolling the same weighted V-sided die This model is wrong! How can it possibly be useful for doc classification?

Bayes Rule for text inference Noisy channel model Original text Hypothesized transmission process Inference problem Observed data Codebreaking P(plaintext encrypted text) / P(encrypted text plaintext) P(plaintext) Enigma machine Bletchley Park (WWII)

Bayes Rule for text inference Noisy channel model Original text Hypothesized transmission process Inference problem Observed data Codebreaking P(plaintext encrypted text) / P(encrypted text plaintext) P(plaintext) Speech recognition P(text acoustic signal) / P(acoustic signal text) P(text)

Bayes Rule for text inference Noisy channel model Original text Hypothesized transmission process Inference problem Observed data Codebreaking P(plaintext encrypted text) / P(encrypted text plaintext) P(plaintext) Speech recognition P(text acoustic signal) / P(acoustic signal text) P(text) Optical character recognition P(text image) / P(image text) P(text)

Bayes Rule for text inference Noisy channel model Original text Hypothesized transmission process Inference problem Observed data Codebreaking P(plaintext encrypted text) / P(encrypted text plaintext) P(plaintext) Speech recognition P(text acoustic signal) Optical character recognition P(text image) One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: This is really written in / P(acoustic signal text) P(text) English, but it has been coded in some strange symbols. I will now proceed to decode. / P(image text) P(text) -- Warren Weaver (1955)

Bayes Rule for text inference Noisy channel model Original text Hypothesized transmission process Inference problem Observed data Codebreaking P(plaintext encrypted text) / P(encrypted text plaintext) P(plaintext) Speech recognition P(text acoustic signal) / P(acoustic signal text) P(text) Optical character recognition P(text image) / P(image text) P(text) Machine translation? P(target text source text) / P(source text target text) P(target text)