More Smoothing, Tuning, and Evaluation

Similar documents
Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Performance Evaluation

CS 188: Artificial Intelligence Spring Today

Introduction to AI Learning Bayesian networks. Vibhav Gogate

Machine Learning for natural language processing

Hypothesis testing I. - In particular, we are talking about statistical hypotheses. [get everyone s finger length!] n =

Probability Review and Naïve Bayes

Performance Evaluation and Hypothesis Testing

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining. MLE and MAP Fall 2017

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

The Naïve Bayes Classifier. Machine Learning Fall 2017

Empirical Risk Minimization, Model Selection, and Model Assessment

Stephen Scott.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ACS Introduction to NLP Lecture 3: Language Modelling and Smoothing

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Introduction to Supervised Learning. Performance Evaluation

Naive Bayes classification

Hypothesis Testing. Framework. Me thodes probabilistes pour le TAL. Goal. (Re)formulating the hypothesis. But... How to decide?

Generative Clustering, Topic Modeling, & Bayesian Inference

Natural Language Processing. Statistical Inference: n-grams

The Noisy Channel Model and Markov Models

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Midterm sample questions

17 Neural Networks NEURAL NETWORKS. x XOR 1. x Jonathan Richard Shewchuk

Machine Learning for NLP

CS340 Machine learning Lecture 5 Learning theory cont'd. Some slides are borrowed from Stuart Russell and Thorsten Joachims

CS 188: Artificial Intelligence. Machine Learning

Empirical Methods in Natural Language Processing Lecture 5 N-gram Language Models

Smart Home Health Analytics Information Systems University of Maryland Baltimore County

LING 473: Day 10. START THE RECORDING Coding for Probability Hidden Markov Models Formal Grammars

P (E) = P (A 1 )P (A 2 )... P (A n ).

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

Applied Natural Language Processing

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Bayesian Methods: Naïve Bayes

Probabilistic Language Modeling

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Prenominal Modifier Ordering via MSA. Alignment

Ranked Retrieval (2)

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

ANLP Lecture 10 Text Categorization with Naive Bayes

CH 24 IDENTITIES. [Each product is 35] Ch 24 Identities. Introduction

Foundations of Natural Language Processing Lecture 6 Spelling correction, edit distance, and EM

Stats notes Chapter 5 of Data Mining From Witten and Frank

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Machine Learning for NLP

Maxent Models and Discriminative Estimation

Natural Language Processing (CSEP 517): Text Classification

Statistical methods for NLP Estimation

Midterm. You may use a calculator, but not any device that can access the Internet or store large amounts of data.

Homework 4, Part B: Structured perceptron

Lecture 12: Algorithms for HMMs

CS 361: Probability & Statistics

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

Lecture 12: Algorithms for HMMs

MIRA, SVM, k-nn. Lirong Xia

10/17/04. Today s Main Points

Hypothesis tests

Classification & Information Theory Lecture #8

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Logistic Regression. Machine Learning Fall 2018

Naïve Bayes, Maxent and Neural Models

CPSC 340: Machine Learning and Data Mining

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Performance Evaluation

CS Homework 3. October 15, 2009

Steve Smith Tuition: Maths Notes

Bayes Theorem & Naïve Bayes. (some slides adapted from slides by Massimo Poesio, adapted from slides by Chris Manning)

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 10

Toss 1. Fig.1. 2 Heads 2 Tails Heads/Tails (H, H) (T, T) (H, T) Fig.2

Evaluating Classifiers. Lecture 2 Instructor: Max Welling

Classification: The rest of the story

Natural Language Processing SoSe Words and Language Model

CSC321 Lecture 18: Learning Probabilistic Models

Support vector machines Lecture 4

Language as a Stochastic Process

Statistical Methods for NLP

Machine Learning. Ludovic Samper. September 1st, Antidot. Ludovic Samper (Antidot) Machine Learning September 1st, / 77

COMPUTATIONAL LEARNING THEORY

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Natural Language Processing

Classification objectives COMS 4771

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

Hypothesis Evaluation

Transcription:

More Smoothing, Tuning, and Evaluation Nathan Schneider (slides adapted from Henry Thompson, Alex Lascarides, Chris Dyer, Noah Smith, et al.) ENLP 21 September 2016 1

Review: 2

Naïve Bayes Classifier w j [words(x)] j return arg max y p(y ) Π j p(w j y ) x y C 3

Naïve Bayes Learner X y, p(y) p( ) (# in Y) (# docs in X) Y y, w, p(w y) w, p p(horrific ) (# docs with horrific) (# docs) L 4

Smoothing p(w y) p(horrific ) (# docs with horrific) (# docs) What if we encounter the word distraught in a test document, but it has never been seen in training? Can t estimate p(distraught ) or p(distraught ): numerator will be 0 Because the word probabilities are multiplied together for each document, the probability of the whole document will be 0 5

Smoothing p(w y) p(horrific ) (# docs with horrific) + 1 (# docs) + V + 1 p(oov ) 1 (# docs) + V + 1 Smoothing techniques adjust probabilities to avoid overfitting to the training data Above: Laplace (add-1) smoothing V is the size of the vocabulary of the training corpus OOV (out-of-vocabulary/unseen) words now have small probability, which decreases the model s confidence in the prediction without ignoring the other words Probability of each seen word is reduced slightly to save probability mass for unseen words 6

Smoothing p(w y) p(horrific ) (# docs with horrific) + 1 (# docs) + V + 1 p(oov ) 1 (# docs) + V + 1 Smoothing techniques adjust probabilities to avoid overfitting to the training data Above: Laplace (add-1) smoothing V is the size of the vocabulary of the training corpus OOV (out-of-vocabulary/unseen) words now have small probability, which decreases the model s confidence in the prediction without ignoring the other words Probability of each seen word is reduced slightly to save probability mass for unseen words 7

New: 8

Smoothing p(w y) p(horrific ) (# docs with horrific) + 1 (# docs) + V + 1 p(oov ) 1 (# docs) + V + 1 V is the size of the vocabulary of the training corpus Laplace (add-1) smoothing, above, uses a pseudo-count of 1, which is kind of arbitrary. For some datasets, it s overkill better to smooth less. Lidstone (add-α) smoothing: tune the amount of smoothing on development data: p(horrific ) (# docs with horrific) + α (# docs) + α(v + 1) p(oov ) α (# docs) + α(v+1) 9

The Nature of Evaluation I Scientific method rests on making and testing hypotheses. I Evaluation is just another name for testing.

The Nature of Evaluation I Scientific method rests on making and testing hypotheses. I Evaluation is just another name for testing. I Evaluation not just for public review: I It s how you manage internal development I And even how systems improve themselves (see ML courses).

What Hypotheses? About existing linguistic objects: I Is this text by Shakespeare or Marlowe? About output of a language system: I How well does this language model predict the data? I How accurate is this segmenter/tagger/parser? I Is this segmenter/tagger/parser better than that one? About human beings: I How reliable is this person s annotation? I To what extent do these two annotators agree? (IAA)

Gold Standard Evaluation I In many cases we have a record of the truth : I The best human judgement as to what the correct segmentation/tag/parse/reading is, or what the right documents are in response to a query. I Gold standards used both for training and for evaluation I But testing must be done on unseen data (held-out test set; train/test split) Don t ever train on data that you ll use in testing!!

Tuning I Often, in designing a system, you ll want to tune it by trying several configuration options and choosing the one that works best empirically. I E.g., Lidstone (add- ) smoothing; choosing features for text classification. I If you run several experiments on the test set, you risk overfitting it; i.e., the test set is no longer a reliable proxy for new data. I One solution is to hold out a second set for tuning, called a development ( dev ) set. Save the test set for the very end.

Cross-validation What if my dataset is too small to have a nice train/test or train/dev/test split? I k-fold cross-validation: partition the data into k pieces and treat them as mini held-out sets. Each fold is an experiment with a di erent held-out set, using the rest of the data for training:

Measuring a Model s Performance Accuracy Proportion model gets right: right test-set 100 E.g., POS tagging (state of the art 96%).

How should we evaluate your rhyming script? Suppose somebody creates a gold standard of <input_word, {rhyming_words}> pairs. Multiple desired outputs; system s outputs could overlap only partially. How to evaluate? Accuracy over all words in the dictionary? 10

Rhymes for hinge Gold System klinge minge vinje binge cringe fringe hinge impinge infringe syringe tinge twinge unhinge ainge 11

Rhymes for hinge Gold System klinge minge vinje binge cringe fringe hinge impinge infringe syringe tinge twinge unhinge False Positive (Type I error) ainge 12

Rhymes for hinge False Negative (Type II error) Gold System False Positive (Type I error) klinge minge vinje binge cringe fringe hinge impinge infringe syringe tinge twinge unhinge ainge 13

Rhymes for hinge False Negative (Type II error) klinge minge vinje Gold binge cringe fringe hinge impinge infringe syringe tinge twinge unhinge System False Positive (Type I error) ainge Sys=Y Sys=N Gold=Y 10 3 Gold=N 1 (large) Correctly predicted = True Positive All other words = True Negative 14

Precision & Recall False Negative (Type II error) klinge minge vinje Gold binge cringe fringe hinge impinge infringe syringe tinge twinge unhinge System False Positive (Type I error) ainge Sys=Y Sys=N Gold=Y 10 3 Gold=N 1 (large) Precision = TP/(TP+FP) = 10/11 = 91% Recall = TP/(TP+FN) = 10/13 = 77% F1 = 2 P R/(P+R) = 83% Correctly predicted = True Positive All other words = True Negative 15

Measuring a Model s Performance Precision, Recall, F-score I For isolating performance on a particular label in multi-label tasks, or I For chunking, phrase structure parsing, or anything where word-by-word accuracy isn t appropriate. I F 1 -score: Harmonic mean of precision (proportion of model s answers that are right) and recall (proportion of test data that model gets right). I E.g., for the POS tag NN: P = R = F 1 = tokens correctly tagged NN all tokens automatically tagged NN = TP TP+FP tokens correctly tagged NN all tokens gold-tagged NN = TP P R TP+FN P+R

Upper Bounds, Lower Bounds? Suppose your POS tagger has 95% accuracy? Is that good? Bad?? Upper Bound: Turing Test When using a human Gold Standard, check the agreement of humans against that standard. Lower Bound: Performance of a simpler model (baseline) I Model always picks most frequent class (majority baseline). I Model assigns a class randomly according to: 1. Even probability distribution; or 2. Probability distribution that matches the observed one. Suitable upper and lower bounds depend on the task.

Measurements: What s Significant? I We ll be measuring things, and comparing measurements. I What and how we measure depends on the task. I But all have one issue in common: Are the di erences we find significant? I In other words, should we interpret the di erences as down to pure chance? Or is something more going on? I Is our model significantly better than the baseline model? Is it significantly worse than the upper bound?

Example: Tossing a Coin I I tossed a coin 40 times; it came up heads 17 times. I Expected value of fair coin is 20. So we re comparing 17 and 20. I If this di erence is significant, then it s (probably) not a fair coin. If not, it (probably) is.

Which Significance Test? I Paremetric when the underlying distribution is normal. I t-test, z-test,... I You don t need to know the mathematical formulae; available in statistical libraries! I Non-Parametric otherwise. I Usually do need non-parametric tests: remember Zipf s Law! I Can use McNemar s test or variants of it. See Smith (2011, Appendix B) for a detailed discussion of significance testing methods for NLP.

Error Analysis I Summary scores are important, but don t always tell the full picture! I Once you ve built your system, it s always a good idea to dig into its output to identify patterns. I Quantitative and qualitative (look at some examples!) I You may find bugs (e.g., predictions are always wrong for words with accented characters)

Confusion Matrices

Tasks where there is > 1rightanswer Example: A Paraphrasing Task I Estimate that John enjoyed the book means John enjoyed reading the book. I Lots of closely related words to read are good too: skim through, go through, peruse, etc.