Language as a Stochastic Process

Similar documents
Natural Language and Statistics

Bayesian Models in Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

Bayesian Methods: Naïve Bayes

Probability Theory. Introduction to Probability Theory. Principles of Counting Examples. Principles of Counting. Probability spaces.

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

Lecture 2: Conjugate priors

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 2: PROBABILITY DISTRIBUTIONS

CSC321 Lecture 18: Learning Probabilistic Models

CS 361: Probability & Statistics

PROBABILITY DISTRIBUTIONS. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Modeling Environment

Probability Background

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

CHAPTER 2 Estimating Probabilities

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Some Concepts of Probability (Review) Volker Tresp Summer 2018

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Discrete Binary Distributions

Expectation Maximization Mixture Models HMMs

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Machine Learning for Signal Processing Expectation Maximization Mixture Models. Bhiksha Raj 27 Oct /

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

Exercise 1: Basics of probability calculus

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Review of probabilities

Probability Theory for Machine Learning. Chris Cremer September 2015

Machine Learning: Homework Assignment 2 Solutions

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Machine Learning. Probability Basics. Marc Toussaint University of Stuttgart Summer 2014

Statistical methods for NLP Estimation

Lecture 2: Priors and Conjugacy

Intelligent Systems:

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Bayesian Inference and MCMC

MLE/MAP + Naïve Bayes

COMP90051 Statistical Machine Learning

Lecture 16-17: Bayesian Nonparametrics I. STAT 6474 Instructor: Hongxiao Zhu

Probabilistic and Bayesian Machine Learning

Quantitative Biology II Lecture 4: Variational Methods

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

Naïve Bayes classification

Lecture 10 and 11: Text and Discrete Distributions

Classification & Information Theory Lecture #8

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian Updating with Continuous Priors Class 13, Jeremy Orloff and Jonathan Bloom

Hidden Markov Models

Linear Models in Machine Learning

Ranked Retrieval (2)

Latent Dirichlet Allocation Introduction/Overview

Learning Bayesian network : Given structure and completely observed data

Language Models. Hongning Wang

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Machine Learning: Assignment 1

Probability Review Lecturer: Ji Liu Thank Jerry Zhu for sharing his slides

COS513 LECTURE 8 STATISTICAL CONCEPTS

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Introduction to Machine Learning. Lecture 2

Bayesian Analysis for Natural Language Processing Lecture 2

Bayesian RL Seminar. Chris Mansley September 9, 2008

Quantitative Understanding in Biology 1.7 Bayesian Methods

Lecture : Probabilistic Machine Learning

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

Lecture 15. Probabilistic Models on Graph

The Bayes classifier

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

1 Probabilities. 1.1 Basics 1 PROBABILITIES

Lecture 2: N-gram. Kai-Wei Chang University of Virginia Couse webpage:

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Midterm sample questions

Preliminary Statistics Lecture 2: Probability Theory (Outline) prelimsoas.webs.com

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Probability Review and Naïve Bayes

The Noisy Channel Model and Markov Models

When working with probabilities we often perform more than one event in a sequence - this is called a compound probability.

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Introduction to Machine Learning

CS 361: Probability & Statistics

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Nonparametric Bayesian Methods (Gaussian Processes)

Motif representation using position weight matrix

Probability Review. September 25, 2015

MLE/MAP + Naïve Bayes

Lecture 2: Probability, Naive Bayes

Note for plsa and LDA-Version 1.1

CS 361: Probability & Statistics

COMP 328: Machine Learning

LECTURE 1. 1 Introduction. 1.1 Sample spaces and events

Gibbs Sampling. Héctor Corrada Bravo. University of Maryland, College Park, USA CMSC 644:

Time-Sensitive Dirichlet Process Mixture Models

Latent Dirichlet Allocation (LDA)

Transcription:

CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any English text ever written. What is x? Random variable X: the random outcome of an experiment. Probability distribution P (X x) x, for x {1,..., 27} if our alphabet consists of {a, b, c,..., z, }. Often use the probability vector ( 1,..., 27 ). Basic properties: 0 x 1, x x 1. Useful analogy of : a k-sided die. Here k 27. If k 2, a coin. A fair die/coin is when 1... k 1/k, otherwise it is biased. Sampling a letter x from the distribution (x ): 1. create intervals of lengths 1,..., k. 2. generate a uniform random number r [0, 1] (most programming languages have a random() function for this). 3. find the interval r falls into, output the interval index. The Multinomial distribution for k-sided die with probability vector, N throws, outcome counts n 1,..., n k : ( ) N k P (n 1,..., n k ) ni n 1... n i. (1) k Going back to language, if we ignore letter order, we can model the letters as draws from a multinomial distribution. The MLE of the parameter is the frequency of letters in the English language (not computable). A corpus (pl. corpora) is a large, representative collection of text. See the Linguistic Data Consortium (LDC) for some examples. The likelihood function is P (Data ). Here Data(n 1,..., n k ), and the likelihood takes the form of (1). Likelihood is a function of, and in general is not normalized: P (Data )d 1. Conditional probability P (X x H h). Given the latest letter is h, what is the next letter x? How do we estimate conditional probabilities? Given a corpus with multiple sentences: i am a student i like this class... Let s add a special symbol start-of-sentence s to each sentence: s i am a student i1 1

Language as a Stochastic Process 2 s i like this class s... and break each sentence into pairs of letters. Note we don t cross sentence boundaries, e.g., no (t s ): ( s i) (i ) ( a) (a m)... (n t) ( s i) (i ) ( l) (l i)... (s s)... Now our random variables x, h take value in { s, a,..., z, }. Let c hx be the count of the pair (hx) in the corpus. We want to estimate the parameters { hx P (x h)}. Note this is much bigger! We can arrange it in a transition matrix with rows for h and columns for x. Intuition: P (x h) c hx / x c hx. It turns out to be the MLE of. Since a sentence always starts with s, the likelihood function P (Data ) is P (i s )P ( i)... P (t n)p (i s )... (2) s i i... (3) h,x c hx hx, (4) with the constraints 0 hx 1, x hx 1, h. One can solve the MLE with Lagrange multiplier as before. The joint probability P (x, h) P (h, x) is the (much rarer event) that the two events both happening. The MLE is P (x, h) c hx / h,x c hx. The marginal probability is obtained by summing over some random variables. For example, P (h) x P (h, x). Relation: P (x, h) P (x h)p (h). (5) Now, h and x do not have to be the same type of random variables. Consider this: randomly pick a word, and randomly pick a letter from the word. Let h be the length (number of letters) of the word, and x be the identity of the letter. It is perfectly fine to ask for P (x a h 2). What if you are told the letter is a, and you have to guess the length of the word? The Bayes rule allows you to flip the conditional probability around: P (x h) P (x, h) P (h) P (h x)p (x) x P (h x )P (x ) P (h x)p (x). (6) P (h) 2 Parameter Estimation (Statistics) Learning a Model (Machine Learning) Now given N draws from an unknown, and we observe the count histogram n 1,..., n k, can we estimate? In addition, how do we predict the next draw? Example 1 If in N 10 coin flips, we observe n 1 4 heads, and n 2 6 tails, what is? Intuition says (0.4, 0.6). But in fact pretty much any could have generated the observed counts. How do we pick one? Should we pick one? Parameter estimation and future prediction are two central problems of statistics, and machine learning. 2.1 The MLE Estimate The Maximum Likelihood Estimate (MLE) is MLE argmax P (Data ). (7)

Language as a Stochastic Process 3 Deriving the MLE of a multinomial (V is the same as k, c is the same as m): ML arg max V simplex P (c 1:V ) (8) arg max arg max V cw w multinomial definition (9) c w log w log() monotonic (10) We are faced with the constrained optimization problem of finding 1:V : max 1:V V c w log w (11) subject to V w 1. (12) The general procedure to solve equality constrained optimization problems is the following: We introduce a scalar β called a Lagrange multiplier (one for each constraint), rewrite the equality constraint as E(x) 0, and define a new Lagrangian function of the form G(x, β) F (x) βe(x), where F (x) is the original objective. Solve for the unconstrained optimization problem on G. In our case, the Lagrangian is ( V ) c w log w β w 1 (13) After verifying that this is a concave function, we set the gradient (w.r.t. 1:V which gives and β) to zero: c w β 0 w w (14) w 1 0, β (15) w ML c w V c w c w C, (16) where C V c w is the length of the corpus. It can be seen that the purpose of β is normalization. Therefore, in this case the MLE is simply the frequency estimate! 2.2 The MAP Estimate The Maximum A Posterior (MAP) is MAP argmax P ( Data) argmax P ()P (Data ) (17) We need a prior distribution P (), which is usually taken to be the Dirichlet distribution since it is conjugate to multinomial: P () Dir( α 1,..., α k ) Γ( k i1 α i) k k i1 Γ(α αi 1 i (18) i) The posterior is again a Dirichlet P ( Data) Dir( α 1 + n 1,..., α k + n k ) Γ( k i1 α i + n i ) k i1 Γ(α i + n i ) i1 k i1 αi+ni 1 i (19)

Language as a Stochastic Process 4 The mode of the posterior (i.e., the MAP estimate) is i n i + α i 1 N + k j1 α j k (20) 2.3 The Bayesian Approach Keep the posterior distribution P ( Data). 3 Predictive Distribution (Statistics) Inference (Machine Learning) For MLE and MAP, use the point estimate of to compute the multinomial. For Bayesian, integrate out w.r.t. the posterior. This gives a multivariate Polya distribution, also known as the Dirichlet compound multinomial distribution. Let β i α i + n i. P (m 1... m k β) P (m 1... m k )p( β)d (21) where m i m i. m! i (m i!) Γ( i β i) Γ(m + i β i) 4 Bag of Word, tf.idf, and Cosine Similarity i Γ(m i + β i ) Γ(β i ) A document d can be represented by a word count vector. The length of the vector is the vocabulary size. The ith element is the number of tokens for the ith word type. This is known as the Bag of word (BOW) representation, as it ignores word order. A variant is a vector of binary indicators for the presence of each word type. A document d can also be represented by a tf idf vector. tf is the normalize term frequency, i.e. word count vector scaled, so that the most frequent word type in this document has a value of 1: tf w c(w, d) max v c(v, d), where c(w, d) is the count of word w in d. idf is inverse document frequency. Let N be the number of documents in the collection. Let n w be the number of documents in which word type w appears. idf w log N n w. idf is an attempt to handle stopwords or near-stopwords: if a word (like the ) appears in most documents, it is probably not very interesting. Since n w N, its idf will be close to 0. Finally tf idf w tf w idf w. Given two documents d, q in feature vector representation, one way to define how similar they are is the cosine similarity: q and d form an angle in the feature space, and (22) sim(d, q) cos() (23) d q d q (24) d q d d q q, (25)

Language as a Stochastic Process 5 where the dot product u v u i v i. i1