Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Similar documents
Computers playing Jeopardy!

How to Use Probabilities. The Crash Course

10/17/04. Today s Main Points

The Noisy Channel Model and Markov Models

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

DT2118 Speech and Speaker Recognition

8: Hidden Markov Models

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Midterm sample questions

Hidden Markov Models, I. Examples. Steven R. Dunbar. Toy Models. Standard Mathematical Models. Realistic Hidden Markov Models.

Hidden Markov Models

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

LECTURER: BURCU CAN Spring

1. Markov models. 1.1 Markov-chain

HMM part 1. Dr Philip Jackson

Natural Language Processing. Statistical Inference: n-grams

Text Mining. March 3, March 3, / 49

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Fun with weighted FSTs

CS460/626 : Natural Language

Statistical Methods for NLP

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

NLP Programming Tutorial 11 - The Structured Perceptron

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Intelligent Systems (AI-2)

Lecture 3: ASR: HMMs, Forward, Viterbi

Sequences and Information

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Parametric Models Part III: Hidden Markov Models

CS325 Artificial Intelligence Ch. 15,20 Hidden Markov Models and Particle Filtering

Log-Linear Models, MEMMs, and CRFs

NLP: N-Grams. Dan Garrette December 27, Predictive text (text messaging clients, search engines, etc)

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Language Modeling. Michael Collins, Columbia University

Variational Decoding for Statistical Machine Translation

8: Hidden Markov Models

Natural Language Processing SoSe Words and Language Model

Hidden Markov Models

Structured Output Prediction: Generative Models

Basic Probability and Statistics

Probabilistic Language Modeling

CS 188: Artificial Intelligence Fall 2011

Bayesian Learning. Examples. Conditional Probability. Two Roles for Bayesian Methods. Prior Probability and Random Variables. The Chain Rule P (B)

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Intelligent Systems (AI-2)

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Parsing with Context-Free Grammars

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Lecture 13: Structured Prediction

Hidden Markov Models in Language Processing

Chapter 3: Basics of Language Modelling

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Dynamic Programming: Hidden Markov Models

Sequential Data Modeling - The Structured Perceptron

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Language Modeling. Introduction to N-grams. Many Slides are adapted from slides by Dan Jurafsky

COMP90051 Statistical Machine Learning

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

CS 188 Introduction to AI Fall 2005 Stuart Russell Final

Machine Learning for natural language processing

Probabilistic Context-free Grammars

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Lecture 9: Hidden Markov Model

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

CMSC 723: Computational Linguistics I Session #5 Hidden Markov Models. The ischool University of Maryland. Wednesday, September 30, 2009

Advanced Natural Language Processing Syntactic Parsing

HMM and Part of Speech Tagging. Adam Meyers New York University

Lecture 2: Probability, Naive Bayes

HIDDEN MARKOV MODELS IN SPEECH RECOGNITION

Statistical methods in NLP, lecture 7 Tagging and parsing

Undirected Graphical Models

Lecture 12: Algorithms for HMMs

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Graphical models for part of speech tagging

LI 569 Intro to Computational Linguistics Assignment 1: Probability Exercises

An introduction to PRISM and its applications

Natural Language Processing

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Maschinelle Sprachverarbeitung

Lecture 2: Probability and Language Models

Maschinelle Sprachverarbeitung

TnT Part of Speech Tagger

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Hidden Markov Models. George Konidaris

Statistical Methods for NLP

Statistical NLP: Hidden Markov Models. Updated 12/15

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

To make a grammar probabilistic, we need to assign a probability to each context-free rewrite

Deep Learning for NLP

Parsing. Probabilistic CFG (PCFG) Laura Kallmeyer. Winter 2017/18. Heinrich-Heine-Universität Düsseldorf 1 / 22

Transcription:

Grammars and introduction to machine learning Computers Playing Jeopardy! Course Stony Brook University

Last class: grammars and parsing in Prolog Noun -> roller Verb thrills VP Verb NP S NP VP NP S VP Verb NP A roller coaster thrills every teenager 2

Today: NLP ambiguity Example: books: NOUN OR VERB You do not need books for this class. She books her trips early. Another example: Thank you for not smoking or playing ipods without earphones. Thank you for not smoking () without earphones These cases can be detected as special uses of the same word Caveout: If we write too many rules, we may write unnatural grammars special rules became general rules it puts a burden too large on the person writing the grammar 3

Ambiguity in Parsing S NP VP S NP Det N NP NP PP VP V NP NP VP VP VP PP PP P NP NP Papa Papa VP PP N caviar N spoon V NP P NP V spoon V ate P with ate Det N with Det N Det the Det a the caviar a spoon 4

Ambiguity in Parsing S NP VP S NP Det N NP NP PP VP V NP NP VP VP VP PP PP P NP NP Papa Papa V NP N caviar N spoon ate NP PP V spoon V ate Det N P NP P with Det the Det a the caviar with Det N 5 a spoon

Scores P A R S E R correct test trees s c o r e r accuracy test sentences Grammar Recent parsers quite accurate good enough to help NLP tasks! 6

Speech processing ambiguity Speech processing is a very hard problem (gender, accent, background noise) Solution: n-grams Letter or word frequencies: 1-grams: THE, COURSE useful in solving cryptograms If you know the previous letter/word: 2-grams h is rare in English (4%; 4 points in Scrabble) but h is common after t (20%)!!! If you know the previous 2 letters/words: 3-grams h is really common after (space) t 7

N-grams An n-gram is a contiguous sequence of n items from a given sequence the items can be: letters, words, phonemes, syllables, etc. Examples: Consider the paragraph: Suppose you should be walking down Broadway after dinner, with ten minutes allotted to the consummation of your cigar while you are choosing between a diverting tragedy and something serious in the way of vaudeville. Suddenly a hand is laid upon your arm. (from the The green door, The Four Million novel, by O. Henry) unigrams (n-grams of size 1): Suppose, you, should, be, walking,... bigrams (n-grams of size 2): Suppose you, you should, should be, be walking,... trigrams (n-grams of size 3): Suppose you should, you should be, should be walking,... four-grams: Suppose you should be, you should be walking,... 8

N-grams 9 An n-gram model is a type of probabilistic language model for predicting the next item in such a sequence Example: from a novel, we can extract all bigrams (sequences of 2- words) with their probabilities of showing up: probability( you can )= number of you can s in text/total number of bigrams = 0.002 probability( you should )= number of you should s in text/total number of bigrams = 0.001 etc. Consider that the current word is you and we want to predict what is the next word: with probability 0.002%, the next word is can with probability 0.001%, the next word is should Advantages: easy training, simple/intuitive probabilistic model.

N-grams in NLP N-gram models are widely used in statistical natural language processing (NLP): in speech recognition, phonemes and sequences of phonemes are modeled using a n-gram distribution: due to multitude of accents, various voice pitches, gender, etc., speech recognition is very error prone, so information about the predicted next phonemes is very useful to detect the exact transcription in parsing text, words are modeled such that each n-gram is composed of n words there is an infinite number of ways to express ideas in natural language. However, there are common and standardized ways used by humans to convert information. A probabilistic model is useful to the parser to construct a data structure that best fits the grammatical rules. in language identification, sequences of characters are modeled for different languages (e.g., in English, t and w are followed by an h 90% of the time). 10

N-grams in NLP Other applications: plagiarism detection, find candidates for the correct spelling of a misspelled word, optical character recognition (OCR), machine translation, correction codes (correct words that were garbled during transmission) Modern statistical models are typically made up of two parts: a prior distribution describing the inherent likelihood of a possible result and a likelihood function used to assess the compatibility of a possible result with observed data. 11

Probabilities and statistics descriptive: mean scores confirmatory: statistically significant? predictive: what will follow? Probability notation p(x Y): p(paul Revere wins weather s clear) = 0.9 Revere s won 90% of races with clear weather p(win clear) = p(win, clear) / p(clear) syntactic sugar logical conjunction of predicates predicate selecting races where weather s clear 12

Properties of p (axioms) p( ) = 0 p(all outcomes) = 1 p(x) p(y) for any X Y p(x Y) = p(x) + p(y) provided X Y= example: p(win & clear) + p(win & clear) = p(win) 13

Properties and Conjunction what happens as we add conjuncts to left of bar? p(paul Revere wins, Valentine places, Epitaph shows weather s clear) probability can only decrease what happens as we add conjuncts to right of bar? p(paul Revere wins weather s clear, ground is dry, jockey getting over sprain) probability can increase or decrease 14 Simplifying Right Side (Backing Off) - reasonable estimate p(paul Revere wins weather s clear, ground is dry, jockey getting over sprain)

Factoring Left Side: The Chain Rule p(revere, Valentine, Epitaph weather s clear) = p(revere Valentine, Epitaph, weather s clear) * p(valentine Epitaph, weather s clear) * p(epitaph weather s clear) 15 True because numerators cancel against denominators Makes perfect sense when read from bottom to top Revere? Valentine? Revere? Epitaph? Revere? Valentine? Revere? Epitaph, Valentine, Revere? 1/3 * 1/5 * 1/4

Factoring Left Side: The Chain Rule p(revere Valentine, Epitaph, weather s clear) conditional independence lets us use backed-off data from all four of these cases to estimate their shared probabilities If this prob is unchanged by backoff, we say Revere was CONDITIONALLY INDEPENDENT of Valentine and Epitaph (conditioned on the weather s being clear). no 3/4 Revere? yes 1/4 Valentine? no 3/4 Revere? yes 1/4 Epitaph? no 3/4 yes yes 1/4 Valentine? Revere? clear? irrelevant no 3/4 Revere? yes 1/4 16 16

Bayes Theorem p(a B) = p(b A) * p(a) / p(b) Easy to check by removing syntactic sugar Use 1: Converts p(b A) to p(a B) Use 2: Updates p(a) to p(a B) 17 17

Probabilistic Algorithms Example: The Viterbi algorithm computes the probability of a sequence of observed events and the most likely sequence of hidden states (the Viterbi path) that result in the sequence of observed events. http://www.cs.stonybrook.edu/~pfodor/old_page/viterbi/viterbi.p forward_viterbi(+observations, +States, +Start_probabilities, +Transition_probabilities, +Emission_probabilities, -Prob, -Viterbi_path, - Viterbi_prob) forward_viterbi(['walk', 'shop', 'clean'], ['Ranny', 'Sunny'], [0.6, 0.4], [[0.7, 0.3],[0.4,0.6]], [[0.1, 0.4, 0.5], [0.6, 0.3, 0.1]], Prob, Viterbi_path, Viterbi_prob) will return: Prob = 0.03361, Viterbi_path = [Sunny, Rainy, Rainy, Rainy], Viterbi_prob=0.0094 18

Viterbi Algorithm Alice and Bob live far apart from each other. Bob does three activities: walks in the park, shops, and cleans his apartment. Alice has no definite information about the weather where Bob lives. Alice tries to guess what the weather is based on what Bob does: The weather operates as a discrete Markov chain There are two (hidden to Alice) states "Rainy" and "Sunny start_probability = {'Rainy': 0.6, 'Sunny': 0.4} 19 Wikipedia

Viterbi Algorithm A dynamic programming algorithm Input: a first-order hidden Markov model (HMM) states Y initial probabilities π i of being in state i transition probabilities a i,j of transitioning from state i to statej observations x 0,..., x T Output: The state sequence y 0,..., y T most likely to have produced the observations V t,k is the probability of the most probable state sequence responsible for the first t + 1 observations V 0,k = P(x 0 k) π i V T,k = P(x t k) max y Y (a y,k V t-1,y ) 20 Wikipedia

Viterbi Algorithm The transition_probability represents the change of the weather transition_probability = { 'Rainy' : {'Rainy': 0.7, 'Sunny': 0.3}, 'Sunny' : {'Rainy': 0.4, 'Sunny': 0.6}} The emission_probability represents how likely Bob is to perform a certain activity on each day: emission_probability = { 'Rainy' : {'walk': 0.1, 'shop': 0.4, 'clean': 0.5}, 'Sunny' : {'walk': 0.6, 'shop': 0.3, 'clean': 0.1}, } 21 Wikipedia

Viterbi Algorithm Alice talks to Bob and discovers the history of his activities: on the first day he went for a walk on the second day he went shopping on the third day he cleaned his apartment ['walk', 'shop', 'clean'] What is the most likely sequence of rainy/sunny days that would explain these observations? 22 Wikipedia

23 Wikipedia