Statistical methods for NLP Estimation

Similar documents
The Naïve Bayes Classifier. Machine Learning Fall 2017

Language as a Stochastic Process

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Statistical methods in NLP Introduction

Bayesian Methods: Naïve Bayes

CS 361: Probability & Statistics

Introduction to Bayesian Learning. Machine Learning Fall 2018

Bayesian Updating with Discrete Priors Class 11, Jeremy Orloff and Jonathan Bloom

Classification & Information Theory Lecture #8

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

CS 361: Probability & Statistics

CS 188: Artificial Intelligence Fall 2011

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Machine Learning

Machine Learning CMPT 726 Simon Fraser University. Binomial Parameter Estimation

2/3/04. Syllabus. Probability Lecture #2. Grading. Probability Theory. Events and Event Spaces. Experiments and Sample Spaces

Computational Perception. Bayesian Inference

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Naïve Bayes classification

Machine Learning

Some slides from Carlos Guestrin, Luke Zettlemoyer & K Gajos 2

Generative Clustering, Topic Modeling, & Bayesian Inference

Bayesian Analysis for Natural Language Processing Lecture 2

COMP90051 Statistical Machine Learning

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

MA 1125 Lecture 33 - The Sign Test. Monday, December 4, Objectives: Introduce an example of a non-parametric test.

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Bernoulli and Binomial Distributions. Notes. Bernoulli Trials. Bernoulli/Binomial Random Variables Bernoulli and Binomial Distributions.

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Statistical methods in NLP, lecture 7 Tagging and parsing

Bayesian Models in Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Lecture 23 Maximum Likelihood Estimation and Bayesian Inference

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Unit 1: Sequence Models

Estimation of reliability parameters from Experimental data (Parte 2) Prof. Enrico Zio

University of Illinois at Urbana-Champaign. Midterm Examination

Lecture 10: Bayes' Theorem, Expected Value and Variance Lecturer: Lale Özkahya

Introduction to Artificial Intelligence (AI)

Classical and Bayesian inference

CS 188: Artificial Intelligence Spring Today

Bayesian Classifiers, Conditional Independence and Naïve Bayes. Required reading: Naïve Bayes and Logistic Regression (available on class website)

Lecture 9: Naive Bayes, SVM, Kernels. Saravanan Thirumuruganathan


More Smoothing, Tuning, and Evaluation

NATURAL LANGUAGE PROCESSING. Dr. G. Bharadwaja Kumar

Probability and Estimation. Alan Moses

Computational Cognitive Science

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

Machine Learning

Applied Natural Language Processing

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Statistical Methods for NLP

Lecture 1: Probability Fundamentals

Forward algorithm vs. particle filtering

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Lecture 2: Conjugate priors

Computational Cognitive Science

CS 6140: Machine Learning Spring What We Learned Last Week. Survey 2/26/16. VS. Model

CS 6140: Machine Learning Spring 2016

Terminology. Experiment = Prior = Posterior =

[Read Ch. 5] [Recommended exercises: 5.2, 5.3, 5.4]

Probability Lecture #7

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

6.867 Machine Learning

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

MLE/MAP + Naïve Bayes

COS 424: Interacting with Data. Lecturer: Dave Blei Lecture #11 Scribe: Andrew Ferguson March 13, 2007

Frequentist Statistics and Hypothesis Testing Spring

Lecture 18: Bayesian Inference

Natural Language Processing

Chapter 3: Basics of Language Modelling

Lecture 3: More on regularization. Bayesian vs maximum likelihood learning

Statistical learning. Chapter 20, Sections 1 3 1

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

A.I. in health informatics lecture 2 clinical reasoning & probabilistic inference, I. kevin small & byron wallace

ANLP Lecture 10 Text Categorization with Naive Bayes

ELEG 3143 Probability & Stochastic Process Ch. 2 Discrete Random Variables

CSE 473: Artificial Intelligence Autumn Topics

MLE/MAP + Naïve Bayes

Bayesian Learning. Reading: Tom Mitchell, Generative and discriminative classifiers: Naive Bayes and logistic regression, Sections 1-2.

The Noisy Channel Model and Markov Models

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Collapsed Variational Bayesian Inference for Hidden Markov Models

CHAPTER 2 Estimating Probabilities

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Words vs. Terms. Words vs. Terms. Words vs. Terms. Information Retrieval cares about terms You search for em, Google indexes em Query:

CPSC 340: Machine Learning and Data Mining

Discrete and continuous

Intelligent Systems (AI-2)

9. DISCRETE PROBABILITY DISTRIBUTIONS

Probability Review and Naïve Bayes

Transcription:

Statistical methods for NLP Estimation UNIVERSITY OF Richard Johansson January 29, 2015

why does the teacher care so much about the coin-tossing experiment? because it can model many situations: I pick a word from a corpus: is it armchair or not? is the email is spam or legit? do the two annotators agree or not? was the document correctly classied or not?

example: error rates a document classier has an error probability of 0.08 we apply it to 100 documents what is the probability of making exactly 10 errors? what is the probability of making 512 errors?

error rates: solution with Scipy import scipy.stats true_error_rate = 0.08 n_docs = 100 experiment = scipy.stats.binom(n_docs, true_error_rate) print('probability of 10 errors:') print(experiment.pmf(10)) print('probability of 5-12 errors:') print(experiment.cdf(12) - experiment.cdf(4))

statistical inference: overview estimate some parameter: what is the estimate of the error rate of my tagger? determine some interval that is very likely to contain the true value of the parameter: 95% condence interval for the error rate test some hypothesis about the parameter (not today): is the error rate signicantly greater than 0.03? is tagger A signicantly better than tagger B?

overview estimating a parameter interval estimates

random samples a random sample x 1,..., x n is a list of values generated by some random variable X for instance, X represents a die, and the sample is [6, 2, 3, 5, 4, 3, 1, 3, 6, 1] typically generated by carrying out some repeated experiment examples: running a PoS tagger on some texts and counting errors word and sentence lengths in a corpus

estimating a parameter from a sample given a sample, how do we estimate some parameter of the random variable that generated the data? often the 'heads' probability p in the binomial but the parameter could be anything: mean, standard deviation,... an estimator is a function that looks at a sample and tries to guesses the value of the parameter we call this guess a point estimate: it's a single value we'll look at intervals later

making estimators there are many ways to make estimators; here are some of the most important recipes the maximum likelihood principle: select the parameter value that maximizes the probability of the data the maximum a posteriori principle: select the most probable parameter value, given the data and my preconceptions about the parameter Bayesian estimation: model the distribution of the parameter if we know the data, then nd the mean of that distribution

maximum likelihood estimates select the parameter value that maximizes the probability of the sample x 1,..., x n mathematically, dene a likelihood function L(p) like this: L(p) = P(x 1,..., x n p) = P(x 1 p)... P(x n p) then nd the p MLE that maximizes L(p) this is a general recipe; now we'll look at a special case

ML estimate of the probability of an event we carry out an experiment n times, and we get a positive outcome x times for instance: we classify 50 documents with 7 errors how do we estimate the probability p of a positive outcome?

estimating the probability this experiment is a binomial r.v. with parameters n and p ML estimation: nd the p MLE that makes x most likely that is, we nd the p that maximizes ( ) n L(p) = P(x p) = p x (1 p) n x x

maximizing the likelihood we classied 20 documents with 7 errors; what's the MLE of the error rate? 0.25 0.25 0.20 0.20 0.15 0.15 0.10 0.10 0.05 0.05 0.00 0 5 10 15 20 p = 0.25 0.00 0 5 10 15 20 p = 0.35

maximizing the likelihood we classied 20 documents with 7 errors; what's the MLE of the error rate? 0.25 0.25 0.20 0.20 0.15 0.15 0.10 0.10 0.05 0.05 0.00 0 5 10 15 20 p = 0.25 0.00 0 5 10 15 20 p = 0.35 it can be shown that the value of p that gives us the maximum of L(p) is p MLE = x n

example: ML estimate of the probability of a word we observer the words in a corpus of 1,173,766 words I see the word dog 10 times assuming a unigram model of language: what the probability of a randomly selected word being dog?

example: ML estimate of the probability of a word we observer the words in a corpus of 1,173,766 words I see the word dog 10 times assuming a unigram model of language: what the probability of a randomly selected word being dog? ML estimate: 10 p MLE (dog) = 1173766

evaluating performance of NLP systems when evaluating NLP systems, many performance measures can be interpreted as probabilities error rate and accuracy: accuracy = P(correct) error rate = P(error) precision and recall: precision = P(true guess) recall = P(guess true) we estimate all of them using MLE

information retrieval example there are 1,752 documents about cheese in my collection I type cheese into the Lucene search engine and it returns a number of documents, out of which 1,432 are about cheese what's the estimate of the recall of this information retrieval system?

information retrieval example there are 1,752 documents about cheese in my collection I type cheese into the Lucene search engine and it returns a number of documents, out of which 1,432 are about cheese what's the estimate of the recall of this information retrieval system? R MLE = 1432 1752 = 0.817

maximum a posteriori estimates maximum a posteriori (MAP) is a recipe that's an alternative to MLE the dierence: it takes our prior beliefs into account coins tend to be quite even, so you need to get 'heads' many times if you're going to persuade me!

maximum a posteriori estimates maximum a posteriori (MAP) is a recipe that's an alternative to MLE the dierence: it takes our prior beliefs into account coins tend to be quite even, so you need to get 'heads' many times if you're going to persuade me! instead of just the likelihood, we maximize the posterior probability of the parameter: P(p data) P(p) P(data p) posterior prior likelihood we use the prior to encode what we believe about p if I make no assumption (uniform prior), MLE = MAP

MAP with a Dirichlet prior the Dirichlet distribution is often used as a prior in MAP estimates assume that we pick n words randomly from a corpus 0.5 the words come from a vocabulary with the size V we saw the word armchair x times out of n with a Dirichlet prior with a concentration parameter α, the MAP estimate is p MAP = for instance, with α = 2, we get x + (α 1) n + V (α 1) p MAP = x + 1 n + V 2.0 1.5 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0

overview estimating a parameter interval estimates

interval estimates if we get some estimate by ML, can we say something about how reliable that estimate is? a condence interval for the parameter p with signicance value α is an interval [p low, p high ] so that P(p low p p high ) α for instance: with 95% probability, the error rate of the spam lter is in the interval [0.05, 0.08] that is: it is between 0.05 and 0.08

interval estimates: overview in this course, we will mainly use condence intervals when we evaluate the performance of some system we will now see a cookbook method for computing condence intervals for probability estimates, such as classier error rate precision / recall for a classier or retrieval system for more complex evaluation metrics, we'll show a more advanced technique in later lectures part-of-speech tagging accuracy bracket precision / recall for a phrase structure parser word error rate in ASR BLEU score in machine tranlation

the distribution of our estimator our ML or MAP estimator applied to randomly selected samples is a random variable with a distribution this distribution depends on the sample size large sample more concentrated distribution we will reason about this distribution 0.05 to show how a condence interval can 0.00 0.0 0.2 0.4 0.6 0.8 1.0 be found n = 25 0.15 0.10

estimator distribution and sample size (p = 0.35) 0.30 0.25 0.15 0.20 0.15 0.10 0.10 0.05 0.05 0.00 0.0 0.2 0.4 0.6 0.8 1.0 n = 10 0.00 0.0 0.2 0.4 0.6 0.8 1.0 n = 25 0.14 0.10 0.12 0.08 0.10 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0.00 0.0 0.2 0.4 0.6 0.8 1.0 n = 50 0.00 0.0 0.2 0.4 0.6 0.8 1.0 n = 100

the distribution of ML estimates of heads probabilities if the true p value is 0.85 and we toss the coin 25 times, what results do we get? 0.30 0.25 0.20 0.15 0.10 0.05 0.00 16 18 20 22 24 26 95% of the experiments give a result between 17 and 24... so 95% of the estimates will be between 0.68 and 0.96

condence interval for the ML estimation of a probability assume we toss the coin n times, with x 'heads' cookbook recipe for computing an approximate 95% condence interval [p low, p high ]: rst estimate p = x/n as usual to compute the lower bound p low : 1. let X be a binomially distributed r.v. with parameters n, p 2. nd xlow = the 2.5% percentile of X 3. plow = xlow /n for the upper bound p high, use the 97.5% percentile instead 0.30 0.25 0.20 0.15 0.10 0.05 0.00 16 18 20 22 24 26

in Scipy assume we got 'heads' x times out of n recall that we use ppf to get the percentiles! p_est = x / n rv = scipy.stats.binom(n, p_est) p_low = rv.ppf(0.025) / n p_high = rv.ppf(0.975) / n print(p_low, p_high)

example: political polling I ask 38 randomly selected Gothenburgers about whether they support the congestion tax in Gothenburg 22 of them say yes an approximate 95% condence interval for the popularity of the tax is 0.421 0.737 number_yes = 22 total_number = 38 p_est = number_yes / total_number rv = scipy.stats.binom(total_number, p_est) p_low = rv.ppf(0.025) / total_number p_high = rv.ppf(0.975) / total_number print(p_low, p_high)

don't forget your common sense I ask 14 MLT students about whether they support the congestion tax, 11 of them say yes will I get a good estimate? in NLP: the WSJ fallacy

next time exercise in the computer lab you will estimate some probabilities and compute condence intervals