Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Similar documents
Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Online Videos FERPA. Sign waiver or sit on the sides or in the back. Off camera question time before and after lecture. Questions?

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

More Smoothing, Tuning, and Evaluation

Deep Learning for NLP Part 2

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Lecture 6: Neural Networks for Representing Word Meaning

word2vec Parameter Learning Explained

Natural Language Processing with Deep Learning CS224N/Ling284

From perceptrons to word embeddings. Simon Šuster University of Groningen

arxiv: v3 [cs.cl] 30 Jan 2016

Overview Today: From one-layer to multi layer neural networks! Backprop (last bit of heavy math) Different descriptions and viewpoints of backprop

Word Embeddings 2 - Class Discussions

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

DISTRIBUTIONAL SEMANTICS

Bayesian Paragraph Vectors

Probability Review and Naïve Bayes

Naïve Bayes, Maxent and Neural Models

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

An overview of word2vec

Natural Language Processing

Learning Features from Co-occurrences: A Theoretical Analysis

Deep Learning Basics Lecture 10: Neural Language Models. Princeton University COS 495 Instructor: Yingyu Liang

Deep Learning for NLP

NEURAL LANGUAGE MODELS

Generative Clustering, Topic Modeling, & Bayesian Inference

1 Probabilities. 1.1 Basics 1 PROBABILITIES

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Machine Learning Lecture 12

You have 3 hours to complete the exam. Some questions are harder than others, so don t spend too long on any one question.

Sample questions for Fundamentals of Machine Learning 2018

Random Coattention Forest for Question Answering

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

text classification 3: neural networks

CSC321 Lecture 16: ResNets and Attention

Introduction to Machine Learning Midterm Exam

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

DS-GA 1003: Machine Learning and Computational Statistics Homework 7: Bayesian Modeling

Neural Network Language Modeling

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

VBM683 Machine Learning

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Performance Evaluation

Natural Language Processing and Recurrent Neural Networks

Module 03 Lecture 14 Inferential Statistics ANOVA and TOI

Lecture 5 Neural models for NLP

Multi-theme Sentiment Analysis using Quantified Contextual

Lecture 2: Linear regression

Neural Word Embeddings from Scratch

Word2Vec Embedding. Embedding. Word Embedding 1.1 BEDORE. Word Embedding. 1.2 Embedding. Word Embedding. Embedding.

Ranked Retrieval (2)

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Latent Dirichlet Allocation Based Multi-Document Summarization

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CS 124 Math Review Section January 29, 2018

IFT Lecture 7 Elements of statistical learning theory

Chapter IR:VIII. VIII. Evaluation. Laboratory Experiments Performance Measures Training and Testing Logging

Support vector machines Lecture 4

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Ways to make neural networks generalize better

Deep Learning. Ali Ghodsi. University of Waterloo

CIS519: Applied Machine Learning Fall Homework 5. Due: December 10 th, 2018, 11:59 PM

CS 229 Project Final Report: Reinforcement Learning for Neural Network Architecture Category : Theory & Reinforcement Learning

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Computational Cognitive Science

Natural Language Processing (CSEP 517): Text Classification

Machine Learning, Midterm Exam: Spring 2008 SOLUTIONS. Q Topic Max. Score Score. 1 Short answer questions 20.

Linear classifiers Lecture 3

Seq2Seq Losses (CTC)

Performance Evaluation and Comparison

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

ECE 5424: Introduction to Machine Learning

Logistic Regression. COMP 527 Danushka Bollegala

Language as a Stochastic Process

FA Homework 5 Recitation 1. Easwaran Ramamurthy Guoquan (GQ) Zhao Logan Brooks

TDT4173 Machine Learning

Midterm sample questions

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Machine Learning Lecture 10

Naive Bayes classification

Qualifying Exam in Machine Learning

CS 188: Artificial Intelligence Spring Today

Natural Language Processing with Deep Learning CS224N/Ling284

Machine Learning. Neural Networks. (slides from Domingos, Pardo, others)

Performance Evaluation and Hypothesis Testing

Qualifier: CS 6375 Machine Learning Spring 2015

Lecture 10: Generalized likelihood ratio test

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

Basic Probability. Introduction

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Part of the slides are adapted from Ziko Kolter

Discrete Finite Probability Probability 1

NEAL: A Neurally Enhanced Approach to Linking Citation and Reference

Dreem Challenge report (team Bussanati)

Analysis of Variance (ANOVA)

Behavioral Data Mining. Lecture 7 Linear and Logistic Regression

CSCE 561 Information Retrieval System Models

Transcription:

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2) Simone Teufel (Lead demonstrator Guy Aglionby) sht25@cl.cam.ac.uk; ga384@cl.cam.ac.uk This is the second part of the NLP practical, where we will use a better document representation created by doc2vec, give you some ideas for how to perform error analysis and interpretation of the results, and introduce a more powerful statistical test. You will write up these results in Report 2, with a new word limit of 1000 words (changed from previous information I gave you). In the first part of the practical (separate document!) you coded two baseline systems that operate with bags of words. In particular, you should have the following by now: code for NB classifier code for feature treatment (bigrams etc) code for Porter stemming code for performing n-fold crossvalidation code for performing the sign test (with ties treatment as described) installation of SVMlight (or some other SVM classifier) 1 Validation Corpus We will now introduce the use of a validation corpus. This was mentioned in MLRD, but you have not done such experiments yet. The validation corpus s purpose is to allow you to set (i.e., learn/try out) parameters of any kind necessary by your ML algorithm, without danger of overfitting. The rule is that no part of the validation corpus can be used for training or testing. Even just setting a feature frequency cutoff for SVM BOW requires the use of a validation corpus (I really should have told you last time already to use one). Practically, please designate 10% (the first fold in stratified Round-Robin cross-validation) for this purpose. The standard way to use the validation corpus is with a 10-10-80 split, where you set your parameters by training on the 80% training split, choose the best on comparing results on the validation split, then discard the validation split and test the best system, i.e, only once, on the test data that has been left alone until then. Because we want comparability of cross-validated results with Pang et al, however, this is not an option. You should therefore do the following: set parameter to one setting, train on the entire 90% measure on validation corpus Move on to next parameter setting and measure again Repeat until you know which parameter worked best on the validation corpus; set your final system to this. This part of the practical is based on a practical designed by Helen Yannakoudakis. 1

Figure 1: word2vec embeddings, trained via prediction. Figure 2: doc2vec, DM architecture Then discard all trained models it would not be fair to use these as they include the test data. But you are now allowed to run a normal crossvalidation on the 90% (with slightly smaller splits), using the hyperparameters you set with the validation corpus. You still haven t tested on training with this method, but arguably some information about the test data has very indirectly sneaked into your parameters. Researchers differ as to whether they consider this overfitting or not, as it s probably a very small effect. 2 Doc2vec for Sentiment Analysis Mikolov et al. (2013) presented the first approach to use Deep Learning (neural networks with at least one hidden layer) to NLP. They introduced the ideas of skipgrams and word2vec to create a more compact vector space representation where dimensions don t correspond to context words, but are no longer interpretable. This has often been called neural word embeddings. The approach can be used for language modelling (predicting the next word, given a context), for classification and many more NLP applications. Based on word2vec, Le and Mikolov (2014) introduced doc2vec, which is able to learn embeddings for sequences of words. It is agnostic to granularity, meaning that the vectors that are created could represent a sequence of any length: a sentence, a paragraph, or even an entire document. The output of doc2vec is a document embedding, a new type of vector that has been shown to be effective for various/some tasks, including sentiment analysis. Quick recap from lecture 9 on the distributed representation of words and prediction in word2vec (Figure 1): the CBOW model (continuous bags of words) learns to predict the target word, given some context words. There are two possible architectures in doc2vec: the distributed memory (dm) architecture 1 and the distributed bag of words (dbow) architecture. DM architecture is shown in Figure 2 and DBOW architecture in Figure 3. In the DM architecture, a paragraph token is added to the word2vec situation, and each paragraph is mapped to a unique vector. The paragraph vector now also contributes to the 1 Sometimes referred to as DMPV (distributed memory of paragraph vector). 2

Figure 3: doc2vec, DBOW architecture prediction task. The paragraph vector and word vectors are averaged or concatenated to predict the next word in a context. It is shared across all contexts from the same paragraph 2, and acts as a memory of context/topic. The learning (constraint satisfaction) satisfies the following equation 3 : Optimisation objective: T 1 k log p(w t w t k,..., w t+k ) T Softmax output layer: t=k p(w t w t k,..., w t+k ) = exp y w t i exp y ; y = b + U h(w t k,..., w t+k ; W ) i In the DBOW model, alternatively, paragraph vectors are trained to predict words in a window (no word order); similar to Skip-gram model. The relevant level of granularity is the document (review). Train various doc2vec models using the IMDB movie review database to learn document-level embeddings. For training, we use word vectors, weights, paragraph vectors (seen paragraphs). At test time, paragraph vectors are inferred by gradient descending while all else is kept fixed (word vectors, weights). The paragraph vectors received this way can be used directly as a document representation for supervised ML (here, the SVM classifier). This is a database of 100,000 movie reviews you should use for training doc2vec: http://ai.stanford.edu/ amaas/data/sentiment/ For our task, you should use the gensim python doc2vec library. There are a number of parameters in doc2vec that can be set: Training algorithm (dm, dbow) The size of the feature vectors (e.g., 100 dimensions good enough for us) Number of iterations / epochs (e.g., 10 or 20) Context window Hierarchical softmax (faster version)... 2 Please note that the paragraph vector is shared across all contexts from the same paragraph, but not across paragraphs, whereas word matrix W is shared across paragraphs. 3 Images and formulas from Le and Mikolov (2014), though note that are there are inaccuracies: 1. figure is confusing as it does not take into account context from both sides. 2. The formula should not include the target word in the conditional. 3. In the U matrix, rows are words (in our vocab for softmax); columns features; h is the column vector and b and y are column vectors (latter size of vocab, so each cell number per word that indicates how likely, though this has not yet been converted to a probability yet) 3

Please familiarise yourself with the packages, train different doc2vec models, and implement a word embedding-based SVM classifier. Ideas for training different models include choosing the training algorithm, the way the context word vectors are combined, and the dimensionality of the resulting feature vectors. Observe the relative performance wrt. to the traditional, flat BOW representation. 3 A more powerful statistical test: Permutation test From now on, you should use the permutation test for paired samples (we always have paired samples in this setting, as two systems are run on identical data). Like any paired test, the permutation test tests whether the population mean of the two conditions is different (H 1 ) or the same (H 0 ). Like the sign test, the permutation test is non-parametric, which means that there are no assumptions about the distribution in your underlying data. There is nothing wrong with the sign test, but it does not have a high power. The permutation test has a higher power than the sign test. You can also use the permutation test on real-valued results (here: just binary). For an explanation of what is meant by the power of a statistical test, consider the two values α and β. α = P (Type I Error) = P (RejectH 0 H 0 is True) α is the probability of a false positive (significance level), ie, the situation where the test declares a difference where there is really none. β = P (Type II Error) = P (Do Not Reject H 0 H 1 is True) β in turn is the probability of a false negative, i.e., when a test declares no difference where there really is one. 1-β is called the power of the test. The permutation test s probability of not discovering a real difference is therefore lower than the sign test s. This is a good property, because it also means that the test is more economical with data. The core assumption of the permutation test is that if the measured difference d in mean M between systems A and B is a fluke, i.e., if there is no real difference between them (because they come from one and the same distribution), it should not matter how many times I randomly swap the two results. Say we have n paired results of System A and B. There are 2 n ways of flipping or not flipping the n pairs of results, i.e., 2 n permutations (hence the name of the test). These permutations operate row-wise. In other words, we create resamplings of these permutations in the following way: for each paired observation in the original runs, a i and b i, a coin is flipped. If it comes up heads (call this 1 ), then swap the score for b i with a i. Otherwise, leave the pair unchanged. Now calculate the means M of the two system under this permutation, calculate their difference in means, and then do it 2 n times for all permutations. Count how many of the 2 n differences are as high as the difference in the unpermuted version. Call this number s. There is a final twist: if due to a high number n you cannot do all exponentially many resamplings, then you should test a large enough random subset of cardinality R. This version of the test is called Monte Carlo Permutation Test. If R = 2 n, the test is referred to as the exact permutation test. The probability of observing the difference between the original runs by chance approximated by: p = s + 1 (1) R + 1 Figure?? shows a simple example of the Permutation test with real-valued results. 4 As there are 6 items, exhaustive resampling would mean 64 permutations (one of these is shown in the table). Out of these, only 2 are equal or larger than the observed real difference 4 Example due to Yiannos Stathopolous; values are MAPs (mean average precisions), where each item is a query in an Information Retrieval (IR) setting. 4

Original One permutation System A System B Coin Toss Permuted A Permuted B Item 1 0.01 0.1 1 0.1 0.01 Item 2 0.03 0.15 0 0.03 0.15 Item 3 0.05 0.2 0 0.05 0.2 Item 4 0.01 0.08 1 0.08 0.01 Item 5 0.04 0.3 0 0.04 0.3 Item 6 0.02 0.4 1 0.4 0.02 Observed MAP 0.0267 0.205 0.117 0.105 Absolute Observed Difference 0.178 0.0017 Figure 4: Example: one out of 2 6 possible permutations: 100101. in MAP, 0.178. The p-value= 2 64 = 0.0462 means that we can reject the null hypothesis at confidence level α = 0.05. For this practical, please implement the Monte Carlo Permutation test and use it from now on for all statistical testing where it s applicable. You can play around with its relative performance with the sign test, but you should not find a case where the sign test declares a difference and the Permutation test does not. If they contradict each other, because the Permutation test is stronger, you should always believe it. Use R = 5000. 4 Experimentation The goal of this practical is to perform good science. This means that getting high numerical results isn t everything. You should aim for: an interesting research question sound methodology insightful analysis (something non-obvious) Because doc2vec s behaviour is mathematically too complex to trace in detail, and because the representations it produces are not directly interpretable, you might want to find out what the model is really doing. You can perform some targeted/selected experimentation along the following lines: Perform an error analysis which documents does the system get wrong most catastropically (i.e, despite high confidence)? Can you see patterns in the errors? What do you think the reason for systematic errors is? Which manipulation could improve the system? You could also simulate a deployment test with recent films either downloaded from IMDB, or write your own and swap with a colleague. Or you could aim to explain what the model is really doing, like Lau and Baldwin (2016), and Li et al. (2015): Are meaningfully similar documents close to each other? (Tip (advanced): there is visualisation SW for interpretability, typically taking the form of heat maps) 5 Some useful resources Doc2vec: https://radimrehurek.com/gensim/models/doc2vec.html https://github.com/rare-technologies/gensim/blob/develop/docs/notebooks/ doc2vec-imdb.ipynb https://github.com/jhlau/doc2vec 5

Scikit: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_ text_data.html TensorFlow: https://www.tensorflow.org/programmers_guide/embedding http://projector.tensorflow.org/ t-sne: https://lvdmaaten.github.io/tsne/ MALLET: http://mallet.cs.umass.edu/topics.php 6 Some relevant papers Lau, J. H. and Baldwin, T. (2016). An empirical evaluation of doc2vec with practical insights into document embedding generation. In Proceedings of the 1st Workshop on Representation Learning for NLP. Dai, A. M., Olah, C., and Le, Q. V. (2015). Document embedding with paragraph vectors. arxiv preprint arxiv:1507.07998. Li, J., Chen, X., Hovy, E., and Jurafsky, D. (2015). Visualizing and understanding neural models in nlp. In Proceedings of NAACL. 7 How to write this up Please refer to the far more contentful slides on this topic on the website. But a few hints right away: Pretend this is not a class assignment but your own idea/research question. Direct this not at the instructor (me), but at a general reader who has no knowledge of what you have done beyond what you tell them. Assume some knowledge of computational linguistic background (but if you use real technical terminology, define it first.) Don t explain obvious things (or risk looking naive), but give enough detail for an expert. After describing your data, dataset handling and methodology, describe your numerical results (clearly separated from methods). Only then do you analyse your numerical results: what is a source of errors? Interpretability of doc2vec space? 6