Language Technology. Unit 1: Sequence Models. CUNY Graduate Center. Lecture 4a: Probabilities and Estimations

Similar documents
Natural Language Processing

Unit 1: Sequence Models

Midterm sample questions

What is Probability? Probability. Sample Spaces and Events. Simple Event

Chapter 6. Properties of Regular Languages

Conditional Probability

Dept. of Linguistics, Indiana University Fall 2015

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

On-Line Learning with Path Experts and Non-Additive Losses

LECTURE 1. 1 Introduction. 1.1 Sample spaces and events

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Statistics for Managers Using Microsoft Excel (3 rd Edition)

Review of Basic Probability

Lecture Notes 1 Basic Probability. Elements of Probability. Conditional probability. Sequential Calculation of Probability

Probability Theory. Introduction to Probability Theory. Principles of Counting Examples. Principles of Counting. Probability spaces.

Lecture 1. ABC of Probability

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Review Basic Probability Concept

Noisy Channel and Hidden Markov Models

Statistics for Managers Using Microsoft Excel/SPSS Chapter 4 Basic Probability And Discrete Probability Distributions

Lecture Lecture 5

Naïve Bayes classification

Probability Theory for Machine Learning. Chris Cremer September 2015

Hidden Markov Models

Example: Suppose we toss a quarter and observe whether it falls heads or tails, recording the result as 1 for heads and 0 for tails.

Probability Review. September 25, 2015

Probability Theory Review

Lecture 6: The Pigeonhole Principle and Probability Spaces

Statistical Methods for NLP

STAT 430/510 Probability

Graphical Models. Mark Gales. Lent Machine Learning for Language Processing: Lecture 3. MPhil in Advanced Computer Science

An Introduction to Bioinformatics Algorithms Hidden Markov Models

P(t w) = arg maxp(t, w) (5.1) P(t,w) = P(t)P(w t). (5.2) The first term, P(t), can be described using a language model, for example, a bigram model:

Weighted Finite State Transducers in Automatic Speech Recognition

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

EM algorithm and applications Lecture #9

Statistical Sequence Recognition and Training: An Introduction to HMMs

Introduction to Finite Automaton

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Lecture 3 Probability Basics

Formal Languages, Automata and Models of Computation

Probabilistic Systems Analysis Spring 2018 Lecture 6. Random Variables: Probability Mass Function and Expectation

Econ 325: Introduction to Empirical Economics

Chapter Learning Objectives. Random Experiments Dfiii Definition: Dfiii Definition:

More on Distribution Function

4/17/2012. NE ( ) # of ways an event can happen NS ( ) # of events in the sample space

CS 188: Artificial Intelligence Fall 2011

Probabilistic models

CS 2750: Machine Learning. Bayesian Networks. Prof. Adriana Kovashka University of Pittsburgh March 14, 2016

Lecture 1 : The Mathematical Theory of Probability

Lecture 4: State Estimation in Hidden Markov Models (cont.)

ECE 302: Chapter 02 Probability Model

Recitation 2: Probability

Graphical models for part of speech tagging

Implementing Machine Reasoning using Bayesian Network in Big Data Analytics

Deep Learning for Computer Vision

5 Context-Free Languages

Conditional Probability

4th IIA-Penn State Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur

Statistical methods in recognition. Why is classification a problem?

Machine Learning: Probability Theory

SDS 321: Introduction to Probability and Statistics

The Noisy Channel Model and Markov Models

CS206 Review Sheet 3 October 24, 2018

Probability Review. Chao Lan

CMPSCI 240: Reasoning about Uncertainty

Lecture 3 - Axioms of Probability

Part (A): Review of Probability [Statistics I revision]

3515ICT: Theory of Computation. Regular languages

Advanced Probabilistic Modeling in R Day 1

CS 630 Basic Probability and Information Theory. Tim Campbell

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

2/3/04. Syllabus. Probability Lecture #2. Grading. Probability Theory. Events and Event Spaces. Experiments and Sample Spaces

FINAL EXAM CHEAT SHEET/STUDY GUIDE. You can use this as a study guide. You will also be able to use it on the Final Exam on

Directed and Undirected Graphical Models

Lecture 3: ASR: HMMs, Forward, Viterbi

Statistical Inference

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Sociology 6Z03 Topic 10: Probability (Part I)

EE 178 Lecture Notes 0 Course Introduction. About EE178. About Probability. Course Goals. Course Topics. Lecture Notes EE 178

CMPSCI 240: Reasoning Under Uncertainty

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

NLP: Probability. 1 Basics. Dan Garrette December 27, E : event space (sample space)

Hidden Markov Models

OpenFst: An Open-Source, Weighted Finite-State Transducer Library and its Applications to Speech and Language. Part I. Theory and Algorithms

A brief introduction to Conditional Random Fields

3rd IIA-Penn State Astrostatistics School July, 2010 Vainu Bappu Observatory, Kavalur

Hidden Markov Models

HIDDEN MARKOV MODELS

Plan for today. ! Part 1: (Hidden) Markov models. ! Part 2: String matching and read mapping

CPSC340. Probability. Nando de Freitas September, 2012 University of British Columbia

Probability Theory. Prof. Dr. Martin Riedmiller AG Maschinelles Lernen Albert-Ludwigs-Universität Freiburg.

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Basic Probability and Statistics

Theory of Computation

Intro to Information Theory

324 Stat Lecture Notes (1) Probability

Lecture notes for probability. Math 124

PROBABILITY. Admin. Assignment 0 9/9/14. Assignment advice test individual components of your regex first, then put them all together write test cases

M378K In-Class Assignment #1

Transcription:

Language Technology CUNY Graduate Center Unit 1: Sequence Models Lecture 4a: Probabilities and Estimations Lecture 4b: Weighted Finite-State Machines required hard optional Liang Huang

Probabilities experiment (e.g., toss a coin 3 times ) basic outcomes Ω (e.g., Ω={HHH, HHT, HTH,..., TTT} ) event: some subset A of Ω (e.g., A = heads twice ) probability distribution a function p from Ω to [0, 1] e Ω p(e) = 1 probability of events (marginals) p (A) = e A p(e) 2

Joint and Conditional Probs 3

Multiplication Rule 4

Independence P(A, B) = P(A) P(B) or P(A) = P(A B) disjoint events are always dependent! P(A,B) = 0 unless one of them is impossible : P(A)=0 conditional independence: P(A, B C) = P(A C) P(B C) P(A C) = P (A B, C) 5

Marginalization compute marginal probs from joint/conditional probs 6

Bayes Rules alternative bayes rule by partition 7

Most Likely Event 8

Most Likely Given... 9

Estimating Probabilities how to get probabilities for basic outcomes? do experiments count stuff e.g. how often do people start a sentence with the? P (A) = (# of sentences like the... in the sample) / (# of all sentences in the sample) P (A B) = (count of A, B) / (count of B) we will show that this is Maximum Likelihood Estimation 10

Probabilistic Finite-State Machines adding probabilities into finite-state acceptors (FSAs) FSA: a set of strings; WFSA: a distribution of strings 11

WFSA normalization: transitions leaving each state sum up to 1 defines a distribution over strings? or a distribution over paths? => also induces a distribution over strings 12

WFSTs FST: a relation over strings (a set of string pairs) WFST: a probabilistic relation over strings (a set of <s, t, p>: strings pair <s, t> with probability p) what is p representing? 13

Edit Distance as WFST this is simplified edit distance real edit distance as an example of WFST, but not PFST WFST: real edit distance... a:a/0 costs: replacement: 1 insertion: 2 deletion: 2 b:a/1 b:b/0 0 *e*:a/2 a:b/1 a:*e*/2 14

Normalization if transitions leaving each state and each input symbol sum up to 1, then... WFST defines conditional prob p(y x) for x => y what if we want to define a joint prob p(x, y) for x=>y? what if we want p(x y)? 15

Questions of WFSTs given x, y, what is p(y x)? for a given x, what s the y that maximizes p(y x)? for a given y, what s the x that maximizes p(y x)? for a given x, supply all output y w/ respective p(y x) for a given y, supply all input x w/ respective p(y x) 16

Answer: Composition p (z x) = p (y x) p (z y)??? = sumy p (y x) p (z y) have to sum up y given y, z & x are independent in this cascade - Why? how to build a composed WFST C out of WFSTs A, B? again, like intersection sum up the products (+, x) semiring 17

Example 18

Example from M. Mohri and J. Eisner they use (min, +), we use (+, *) 19

Example they use (min, +), we use (+, *) 20

Example they use (min, +), we use (+, *) 21

Given x, supply all output y no longer normalized! 22

Given x, y, what s p(y x) 23

Given x, what s max p(y x) 24

Part-of-Speech Tagging Again 25

Part-of-Speech Tagging Again 26

Adding a Tag Bigram Model (again) FST C: POS bigram LM p(w...w) p(t...t w...w) p(t...t) p(???) wait, is that right (mathematically)? 27

Noisy-Channel Model 28

Noisy-Channel Model p(t...t) 29

Applications of Noisy-Channel 30

Example: Edit Distance O(k) deletion arcs from J. Eisner a:ε" b:ε" a:b ε:a b:a O(k) insertion arcs ε:b b:b a:a O(k) identity arcs 31

Example: Edit Distance clara.o. Best path (by Dijkstra s algorithm) c:ε" l:ε" a:ε" r:ε" a:ε" c:c l:c a:c r:c a:c a:ε" b:ε" a:b ε:c ε:c ε:c ε:c ε:c ε:c c:ε" l:ε" a:ε" r:ε" a:ε" ε:a b:a = ε:a c:a l:a a:a r:a a:a ε:a ε:a c:ε" l:ε" a:ε" ε:a r:ε" ε:a a:ε" ε:a ε:b.o. caca b:b a:a ε:c ε:a c:c l:c a:c r:c a:c ε:c ε:c c:ε" ε:c ε:c ε:c l:ε" a:ε" r:ε" a:ε" c:a l:a a:a r:a a:a ε:a ε:a ε:a ε:a ε:a c:ε" l:ε" a:ε" r:ε" a:ε" 32

Max / Sum Probs in a WFSA, which string x has the greatest p(x)? graph search (shortest path) problem Dijkstra; or Viterbi if the FSA is acyclic does it work for NFA? best path much easier than best string you can determinize it (with exponential cost!) popular work-around: n-best list crunching Edsger Dijkstra (1930-2002) GOTO considered harmful (b. 1932) Viterbi Alg. (1967) CMDA, Qualcomm 33

Dijkstra 1959 vs. Viterbi 1967 Edsger Dijkstra (1930-2002) GOTO considered harmful that s min. spanning tree! Jarnik (1930) - Prim (1957) - Dijkstra (1959) 34

Dijkstra 1959 vs. Viterbi 1967 that s shortest-path Moore (1957) - Dijkstra (1959) 35

Dijkstra 1959 vs. Viterbi 1967 special case of dynamic programming (Bellman, 1957) 36

Sum Probs what is P(x) for some particular x? for DFA, just follow x for NFA, get a subgraph (by composition), then sum?? acyclic => Viterbi cyclic => compute strongly connected components SCC-DAG cluster graph (cyclic locally, acyclic globally) do infinite sum (matrix inversion) locally, Viterbi globally refer to extra readings on course website 37