Applied Natural Language Processing

Similar documents
Natural Language Processing

Deconstructing Data Science

Applied Natural Language Processing

Topics in Natural Language Processing

Applied Natural Language Processing

The Purpose of Hypothesis Testing

Chapter 5: HYPOTHESIS TESTING

Review: General Approach to Hypothesis Testing. 1. Define the research question and formulate the appropriate null and alternative hypotheses.

Statistics Boot Camp. Dr. Stephanie Lane Institute for Defense Analyses DATAWorks 2018

Performance Evaluation and Comparison

The Naïve Bayes Classifier. Machine Learning Fall 2017

Hypothesis Testing. Framework. Me thodes probabilistes pour le TAL. Goal. (Re)formulating the hypothesis. But... How to decide?

Hypothesis tests

How do we compare the relative performance among competing models?

More Smoothing, Tuning, and Evaluation

Natural Language Processing

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

Structured Prediction

Statistics for Managers Using Microsoft Excel/SPSS Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests

Chapter Three. Hypothesis Testing

POLI 443 Applied Political Research

Lecture 7: Hypothesis Testing and ANOVA

Review of Statistics 101

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Econ 325: Introduction to Empirical Economics

Statistical methods for NLP Estimation

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

Multiclass Classification

Structured Prediction Theory and Algorithms

Statistical Methods for Astronomy

Business Statistics: Lecture 8: Introduction to Estimation & Hypothesis Testing

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

280 CHAPTER 9 TESTS OF HYPOTHESES FOR A SINGLE SAMPLE Tests of Statistical Hypotheses

Two-Sample Inferential Statistics

Statistics for IT Managers

Multiclass and Introduction to Structured Prediction

Midterm sample questions

Dynamics in Social Networks and Causality

Multiclass and Introduction to Structured Prediction

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Lectures 5 & 6: Hypothesis Testing

COVENANT UNIVERSITY NIGERIA TUTORIAL KIT OMEGA SEMESTER PROGRAMME: ECONOMICS

Introduction to Statistics

Machine Learning (CS 567) Lecture 2

Evaluation. Andrea Passerini Machine Learning. Evaluation

8.1-4 Test of Hypotheses Based on a Single Sample

Performance Evaluation and Hypothesis Testing

Elementary Statistics Triola, Elementary Statistics 11/e Unit 17 The Basics of Hypotheses Testing

Evaluation requires to define performance measures to be optimized

ML4NLP Multiclass Classification

Learning to translate with neural networks. Michael Auli

AST 418/518 Instrumentation and Statistics

PhysicsAndMathsTutor.com

Conditional Language Modeling. Chris Dyer

1 Descriptive statistics. 2 Scores and probability distributions. 3 Hypothesis testing and one-sample t-test. 4 More on t-tests

hypothesis a claim about the value of some parameter (like p)

STAT Chapter 8: Hypothesis Tests

Single Sample Means. SOCY601 Alan Neustadtl

Deconstructing Data Science

STAT 515 fa 2016 Lec Statistical inference - hypothesis testing

Hypothesis testing. Data to decisions

Chapter 9 Inferences from Two Samples

Stephen Scott.

with Local Dependencies

Lecture on Null Hypothesis Testing & Temporal Correlation

Unit 27 One-Way Analysis of Variance

Harvard University. Rigorous Research in Engineering Education

Chapter 10: Comparing Two Populations or Groups

EMERGING MARKETS - Lecture 2: Methodology refresher

Unit 19 Formulating Hypotheses and Making Decisions

Soc 3811 Basic Social Statistics Second Midterm Exam Spring Your Name [50 points]: ID #: ANSWERS

CS626: NLP, Speech and the Web. Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 14: Parsing Algorithms 30 th August, 2012

Quantitative Methods for Economics, Finance and Management (A86050 F86050)

The Noisy Channel Model. Statistical NLP Spring Mel Freq. Cepstral Coefficients. Frame Extraction ... Lecture 9: Acoustic Models

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Statistical Inference. Why Use Statistical Inference. Point Estimates. Point Estimates. Greg C Elvers

Introductory Econometrics

Deconstructing Data Science

Mathematical Notation Math Introduction to Applied Statistics

Announcements. Unit 3: Foundations for inference Lecture 3: Decision errors, significance levels, sample size, and power.

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

Evaluation Strategies

ANLP Lecture 10 Text Categorization with Naive Bayes

Looking at the Other Side of Bonferroni

Probability and Statistics

LECTURE 5. Introduction to Econometrics. Hypothesis testing

Statistical Process Control (contd... )

Regularization. CSCE 970 Lecture 3: Regularization. Stephen Scott and Vinod Variyam. Introduction. Outline

Section 6.2 Hypothesis Testing

HYPOTHESIS TESTING. Hypothesis Testing

Speech Translation: from Singlebest to N-Best to Lattice Translation. Spoken Language Communication Laboratories

Chapter 18. Sampling Distribution Models /51

From Sentiment Analysis to Preference Aggregation

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Discriminative Training

Soc3811 Second Midterm Exam

AMS7: WEEK 7. CLASS 1. More on Hypothesis Testing Monday May 11th, 2015

Statistical Inference. Hypothesis Testing

Section 5.4: Hypothesis testing for μ

Business Statistics. Lecture 10: Course Review

Transcription:

Applied Natural Language Processing Info 256 Lecture 7: Testing (Feb 12, 2019) David Bamman, UC Berkeley

Significance in NLP You develop a new method for text classification; is it better than what comes before? You re developing a new model; should you include feature X? (when there is a cost to including it) You're developing a new model; does feature X reliably predict outcome Y?

Evaluation A critical part of development new algorithms and methods and demonstrating that they work

Classification A mapping h from input data x (drawn from instance space X) to a label (or labels) y from some enumerable output space Y X = set of all documents Y = {english, mandarin, greek, } x = a single document y = ancient greek

X instance space train dev test

Experiment design training development testing size 80% 10% 10% purpose training models model selection evaluation; never look at it until the very end

Metrics Evaluations presuppose that you have some metric to evaluate the fitness of a model. Text classification: accuracy, precision, recall, F1 Phrase-structure parsing: PARSEVAL (bracketing overlap) Dependency parsing: Labeled/unlabeled attachment score Machine translation: BLEU, METEOR Summarization: ROUGE Language model: perplexity

Metrics Downstream tasks that use NLP to predict the natural world also have metrics: Predicting presidential approval rates from tweets. Predicting the type of job applicants from a job description. Conversational agent

Binary classification Truth System B puppy fried chicken puppy 6 3 fried chicken 2 5 Accuracy = 11/16 = 68.75% https://twitter.com/teenybiscuit/status/705232709220769792/photo/1

Multiclass confusion matrix Predicted (ŷ) Dem Repub Indep Dem 100 2 15 True (y) Repub 0 104 30 Indep 30 40 70

Accuracy 1 N N i=1 I[ŷ i = y i ] Predicted (ŷ) Dem Repub Indep I[x] 1 if x is true 0 otherwise Dem 100 2 15 True (y) Repub 0 104 30 Indep 30 40 70

True (y) Precision Precision(Dem) = N i=1 I(y i = y i = Dem) N i=1 I( y i = Dem) Predicted (ŷ) Dem Repub Indep Dem 100 2 15 Precision: proportion of predicted class that are actually that class. Repub 0 104 30 Indep 30 40 70

True (y) Recall Recall(Dem) = N i=1 I(y i = y i = Dem) N i=1 I(y i = Dem) Predicted (ŷ) Dem Repub Indep Dem 100 2 15 Recall: proportion of true class that are predicted to be that class. Repub 0 104 30 Indep 30 40 70

F score F = 2 precision recall precision + recall

Ablation test To test how important individual features are (or components of a model), conduct an ablation test Train the full model with all features included, conduct evaluation. Remove feature, train reduced model, conduct evaluation.

Ablation test Gimpel et al. 2011, Part-of-Speech Tagging for Twitter

Significance Your work 58% Current state of the art 50% If we observe difference in performance, what s the cause? Is it because one system is better than another, or is it a function of randomness in the data? If we had tested it on other data, would we get the same result?

Hypotheses hypothesis The average income in two sub-populations is different Web design A leads to higher CTR than web design B Self-reported location on Twitter is predictive of political preference Your system X is better than state-of-the-art system Y

Null hypothesis A claim, assumed to be true, that we d like to test (because we think it s wrong) hypothesis The average income in two subpopulations is different Web design A leads to higher CTR than web design B Self-reported location on Twitter is predictive of political preference Your system X is better than state-ofthe-art system Y H0 The incomes are the same The CTR are the same Location has no relationship with political preference There is no difference in the two systems.

Hypothesis testing If the null hypothesis were true, how likely is it that you d see the data you see?

Hypothesis testing Hypothesis testing measures our confidence in what we can say about a null from a sample.

Hypothesis testing Current state of the art = 50%; your model = 58%. Both evaluated on the same test set of 1000 data points. Null hypothesis = there is no difference, so we would expect your model to get 500 of the 1000 data points right. If we make parametric assumptions, we can model this with a Binomial distribution (number of successes in n trials)

Example 0.025 0.020 0.015 0.010 0.005 0.000 400 450 500 550 600 # Dem Binomial probability distribution for number of correct predictions in n=1000 with p = 0.5

Example At what point is a sample statistic unusual enough to reject the null hypothesis? 0.025 0.020 510 0.015 0.010 0.005 580 0.000 400 450 500 550 600 # Dem

Example The form we assume for the null hypothesis lets us quantify that level of surprise. We can do this for many parametric forms that allows us to measure P(X x) for some sample of size n; for large n, we can often make a normal approximation.

Z score Z = X μ σ/ n For Normal distributions, transform into standard normal (mean = 0, standard deviation =1 )

Z score 0.4 510 correct = z score 0.63 0.3 density 0.2 0.1 580 correct = z score 5.06 0.0-6 -3 0 3 6 z

Tests We will define unusual to equal the most extreme areas in the tails

least likely 10% 0.4 0.3 density 0.2 < z=-1.65 > z=1.65 0.1 0.0-4 -2 0 2 4 z

least likely 5% 0.4 0.3 density 0.2 < z=-1.96 > z=1.96 0.1 0.0-4 -2 0 2 4 z

least likely 1% 0.4 0.3 density 0.2 < z=-2.58 > z=2.58 0.1 0.0-4 -2 0 2 4 z

Tests 0.4 510 correct = z score 0.63 0.3 density 0.2 0.1 580 correct = z score 5.06 0.0-6 -3 0 3 6 z

Tests Decide on the level of significance α. {0.05, 0.01} Testing is evaluating whether the sample statistic falls in the rejection region defined by α

Two-tailed tests measured whether the observed statistic is different (in either direction) Tails One-tailed tests measure difference in a specific direction 0.4 0.3 All differ in where the rejection region is located; α = 0.05 for all. density 0.2 0.1 0.0-4 -2 0 2 4 z two-tailed test 0.4 0.4 0.3 0.3 density 0.2 density 0.2 0.1 0.1 0.0 0.0-4 -2 0 2 4 z lower-tailed test -4-2 0 2 4 z upper-tailed test

p values A p value is the probability of observing a statistic at least as extreme as the one we did if the null hypothesis were true. Two-tailed test p-value(z) =2 P(Z z ) Lower-tailed test p-value(z) =P(Z z) Upper-tailed test p-value(z) =1 P(Z z)

Errors Test results keep null reject null Truth keep null reject null Type II error β Type I error α Power

Errors Type I error: we reject the null hypothesis but we shouldn t have. Type II error: we don t reject the null, but we should have.

1 jobs is predictive of presidential approval rating 2 job is predictive of presidential approval rating 3 war is predictive of presidential approval rating 4 car is predictive of presidential approval rating 5 the is predictive of presidential approval rating 6 star is predictive of presidential approval rating 7 book is predictive of presidential approval rating 8 still is predictive of presidential approval rating 9 glass is predictive of presidential approval rating 1,000 bottle is predictive of presidential approval rating

Errors For any significance level α and n hypothesis tests, we can expect α n type I errors. α=0.01, n=1000 = 10 significant results simply by chance

Multiple hypothesis corrections Bonferroni correction: for family-wise significance level α0 with n hypothesis tests: α α 0 n [Very strict; controls the probability of at least one type I error.] False discovery rate

Confidence intervals Even in the absence of specific test, we want to quantify our uncertainty about any metric. Confidence intervals specify a range that is likely to contain the (unobserved) population value from a measurement in a sample.

Confidence intervals Binomial confidence intervals (again using Normal approximation): p = rate of success (e.g., for binary classification, the accuracy). n = the sample size (e.g., number of data points in test set). zα = the critical value at significance level α. 95% confidence interval: α = 0.05; zα = 1.96 99% confidence interval: α = 0.01; zα = 2.58 p ± z α p(1 p) n

Issues Evaluation performance may not hold across domains (e.g., WSJ literary texts) Covariates may explain performance (MT/parsing, sentences up to length n) Multiple metrics may offer competing results Søgaard et al. 2014

100 87.5 75 62.5 50 97.0 English POS 81.9 WSJ Shakespeare 100 87.5 75 62.5 50 97.0 German POS 69.6 Modern Early Modern 100 87.5 75 62.5 50 97.3 English POS 56.2 WSJ Middle English 100 87.5 75 62.5 50 97.0 Italian POS 75.0 News Dante 100 87.5 75 62.5 50 97.3 English POS 73.7 WSJ Twitter 100 85 70 55 40 English NER 89.0 41.0 CoNLL Twitter 90 80 70 60 50 Phrase structure parsing 89.5 79.3 WSJ GENIA 89 79.25 69.5 59.75 50 Dependency parsing 88.2 79.6 WSJ Patent 87 77.75 68.5 59.25 50 Dependency parsing 86.9 77.1 WSJ Magazines

Takeaways At a minimum, always evaluate a method on the domain you re using it on When comparing the performance of models, quantify your uncertainty with significant tests/ confidence bounds Use ablation tests to identify the impact that a feature class has on performance.

Ethics Why does a discussion about ethics need to be a part of NLP?

Conversational Agents

Question Answering http://searchengineland.com/according-google-barack-obama-king-united-states-209733

Language Modeling

Vector semantics

The decisions we make about our methods training data, algorithm, evaluation are often tied up with its use and impact in the world.

Scope dobj nsubj prep pobj det det I saw the man with the telescope prep NLP often operates on text divorced from the context in which it is uttered. It s now being used more and more to reason about human behavior.

Privacy

Interventions

Exclusion Focus on data from one domain/demographic State-of-the-art models perform worse for young (Hovy and Søgaard 2015) and minorities (Blodgett et al. 2016)

Exclusion Language identification Dependency parsing Blodgett et al. (2016), "Demographic Dialectal Variation in Social Media: A Case Study of African-American English" (EMNLP)

Overgeneralization Managing and communicating the uncertainty of our predictions Is a false answer worse than no answer?

Dual Use Authorship attribution (author of Federalist Papers vs. author of ransom note vs. author of political dissent) Fake review detection vs. fake review generation Censorship evasion vs. enabling more robust censorship