A Bayesian Model of Diachronic Meaning Change

Similar documents
Fast Logistic Regression for Text Categorization with Variable-Length N-grams

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Natural Language Processing SoSe Language Modelling. (based on the slides of Dr. Saeedeh Momtazi)

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

A Hierarchical Bayesian Model for Unsupervised Induction of Script Knowledge

Natural Language Processing SoSe Words and Language Model

Lecture 6: Neural Networks for Representing Word Meaning

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Integrated Electricity Demand and Price Forecasting

Predictive Discrete Latent Factor Models for large incomplete dyadic data

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

ESSENTIALS OF BIOCHEMISTRY, 3E WITH SELECTED CHAPTERS FROM FUNDAMENTALS OF BIOCHEMISTRY, VOET (OHIO STATE UNIVERSITY) BY WILEY CUSTOMS LE

Improving Topic Models with Latent Feature Word Representations

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

CS Homework 3. October 15, 2009

Summarizing Opinions: Aspect Extraction Meets Sentiment Prediction and They Are Both Weakly Supervised

Jointly Extracting Event Triggers and Arguments by Dependency-Bridge RNN and Tensor-Based Argument Interaction

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Recent Advances in Bayesian Inference Techniques

A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

Semestrial Project - Expedia Hotel Ranking

Trends, Helpouts and Hidden Gems: Additional Google Tools to Strengthen Your Brand

Predicting New Search-Query Cluster Volume

CSE 546 Final Exam, Autumn 2013

Topic Models and Applications to Short Documents

Advances in promotional modelling and analytics

Modeling Topics and Knowledge Bases with Embeddings

Vector-based Models of Semantic Composition. Jeff Mitchell and Mirella Lapata, 2008

DM-Group Meeting. Subhodip Biswas 10/16/2014

Multiple Aspect Ranking Using the Good Grief Algorithm. Benjamin Snyder and Regina Barzilay MIT

Statistical Models of the Annotation Process

SPATIAL AND SPATIO-TEMPORAL BAYESIAN MODELS WITH R - INLA BY MARTA BLANGIARDO, MICHELA CAMELETTI

APPLYING BIG DATA TOOLS TO ACQUIRE AND PROCESS DATA ON CITIES

Unsupervised Anomaly Detection for High Dimensional Data

Yahoo! Labs Nov. 1 st, Liangjie Hong, Ph.D. Candidate Dept. of Computer Science and Engineering Lehigh University

Information Extraction from Text

DISTRIBUTIONAL SEMANTICS

Sequences and Information

AN INTERNATIONAL SOLAR IRRADIANCE DATA INGEST SYSTEM FOR FORECASTING SOLAR POWER AND AGRICULTURAL CROP YIELDS

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Intelligent Systems (AI-2)

Web Structure Mining Nodes, Links and Influence

Intelligent Systems (AI-2)

CS 188: Artificial Intelligence Spring Announcements

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Topic Modeling: Beyond Bag-of-Words

Geographic Analysis of Linguistically Encoded Movement Patterns A Contextualized Perspective

Lecture 22 Exploratory Text Analysis & Topic Models

Data Mining Recitation Notes Week 3

Preparing Spatial Data

Leverage Sparse Information in Predictive Modeling

Midterm sample questions

Chapter 3: Basics of Language Modeling

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Language Models. Tobias Scheffer

Collaborative Filtering. Radek Pelánek

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Massachusetts Institute of Technology

Department of Computer Science and Engineering Indian Institute of Technology, Kanpur. Spatial Role Labeling

Forecast User Manual FORECAST. User Manual. Version P a g e

Discovering Geographical Topics in Twitter

Prenominal Modifier Ordering via MSA. Alignment

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

Which Step Do I Take First? Troubleshooting with Bayesian Models

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Applied Natural Language Processing

The supervised hierarchical Dirichlet process

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Chart types and when to use them

Semi-Supervised Learning

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

THE CANTERBURY TALES PUBLISHER: BANTAM CLASSICS BY GEOFFREY CHAUCER

NCDREC: A Decomposability Inspired Framework for Top-N Recommendation

Techniques for Science Teachers: Using GIS in Science Classrooms.

Machine Learning Basics

Text mining and natural language analysis. Jefrey Lijffijt

Recommendation Systems

10/17/04. Today s Main Points

Announcements. CS 188: Artificial Intelligence Spring Classification. Today. Classification overview. Case-Based Reasoning

Lecture 2: Probability, Naive Bayes

Kernel Methods. Barnabás Póczos

Putting the U.S. Geospatial Services Industry On the Map

Friendship and Mobility: User Movement In Location-Based Social Networks. Eunjoon Cho* Seth A. Myers* Jure Leskovec

Introduction to Machine Learning Midterm Exam

Recent Developments in Statistical Dialogue Systems

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

Online Passive-Aggressive Algorithms. Tirgul 11

Deep Learning Architecture for Univariate Time Series Forecasting

Click Models for Web Search

Introduction. Chapter 1

The Benefits of a Model of Annotation

Information Retrieval and Organisation

The Noisy Channel Model and Markov Models

COMS 4721: Machine Learning for Data Science Lecture 1, 1/17/2017

Generative Models for Discrete Data

A re-sampling based weather generator

TnT Part of Speech Tagger

Statistical NLP Spring 2010

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Transcription:

A Bayesian Model of Diachronic Meaning Change Lea Frermann and Mirella Lapata Institute for Language, Cognition, and Computation School of Informatics The University of Edinburgh lea@frermann.de www.frermann.de ACL, August 09, 2016 1 / 10

The Dynamic Nature of Meaning Language is a dynamic system, constantly shaped by users and their environment 2 / 10

The Dynamic Nature of Meaning Language is a dynamic system, constantly shaped by users and their environment 2 / 10

The Dynamic Nature of Meaning Language is a dynamic system, constantly shaped by users and their environment 2 / 10

The Dynamic Nature of Meaning Language is a dynamic system, constantly shaped by users and their environment Meaning changes smoothly (in written language, across societies) 2 / 10

Motivation Can we understand, model, and predict change? aid historical sociolinguistic research improve historical text mining and information retrieval Can we build task-agnostic models? learn time-specific meaning representations which are interpretable and are useful across tasks 3 / 10

Scan: A Dynamic Model of Sense change

Model Assumptions target word (e.g., mouse) target word-specific corpus year text snippet 1749 fortitude time woman shrieks mouse rat capable poisoning husband 1915 rabbit lived hole small grey mouse made nest pocket coat 1993 moved fire messages click computer mouse communications appear electronic bulletin 2009 scooted chair clicking button wireless mouse hibernate computer stealthy exit number of word senses (K) granularity of temporal intervals ( T ) (e.g., a year, decade, or century) 4 / 10

Model Overview A Bayesian and knowledge-lean model of meaning change of individual words (e.g., mouse ) 5 / 10

Model Overview A Bayesian and knowledge-lean model of meaning change of individual words (e.g., mouse ) 5 / 10

Model Overview A Bayesian and knowledge-lean model of meaning change of individual words (e.g., mouse ) 5 / 10

Model Overview A Bayesian and knowledge-lean model of meaning change of individual words (e.g., mouse ) 5 / 10

Model Description: Generative Story 6 / 10

Model Description: Generative Story a, b κ φ D t 1 I D t I D t+1 I K κ ψ 1. Extent of meaning change Generate temporal sense flexibility parameter κ φ Gamma(a, b) 6 / 10

Model Description: Generative Story a, b κ φ φ t 1 φ t φ t+1 D t 1 I D t I D t+1 I K ψ t 1 ψ t ψ t+1 κ ψ 1. Extent of meaning change 2. Time-specific representations Generate temporal sense flexibility parameter κ φ Gamma(a, b) Generate sense distributions φ t Generate sense-word distributions ψ k,t 6 / 10

Model Description: Generative Story a, b κ φ φ t 1 φ t φ t+1 D t 1 I D t I D t+1 I K ψ t 1 ψ t ψ t+1 κ ψ 1. Extent of meaning change 2. Time-specific representations Generate temporal sense flexibility parameter κ φ Gamma(a, b) Generate sense distributions φ t Generate sense-word distributions ψ k,t 3. Document generation given time t 6 / 10

Model Description: Generative Story a, b κ φ φ t 1 φ t φ t+1 D t 1 z z z I I D t D t+1 I K ψ t 1 ψ t ψ t+1 1. Extent of meaning change 2. Time-specific representations 3. Document generation given time t κ ψ Generate temporal sense flexibility parameter κ φ Gamma(a, b) Generate sense distributions φ t Generate sense-word distributions ψ k,t Generate sense z Mult(φ t ) 6 / 10

Model Description: Generative Story a, b κ φ φ t 1 φ t φ t+1 D t 1 z w z w z w I I I D t D t+1 K ψ t 1 ψ t ψ t+1 1. Extent of meaning change 2. Time-specific representations 3. Document generation given time t κ ψ Generate temporal sense flexibility parameter κ φ Gamma(a, b) Generate sense distributions φ t Generate sense-word distributions ψ k,t Generate sense z Mult(φ t ) Generate context words w i Mult(ψ t,k=z ) 6 / 10

Scan: The Prior First-order random walk model intrinsic Gaussian Markov Random Field (Rue, 2005; Mimno, 2009) φ 1 φ t 1 φ t φ t+1 φ T draw local changes from a normal distribution mean temporally neighboring parameters variance meaning flexibility parameter κ φ 7 / 10

Learning Blocked Gibbs sampling Details in the paper... 8 / 10

Related Work

Related work Word meaning change Gulordava (2011), Popescu (2013), Kim (2014), Kulkarni (2015) word-level meaning two time intervals representations are independent knowledge-lean 9 / 10

Related work Word meaning change Gulordava (2011), Popescu (2013), Kim (2014), Kulkarni (2015) word-level meaning two time intervals representations are independent knowledge-lean Graph-based tracking of word sense change Mitra (2014, 2015) sense-level meaning multiple time intervals representations are independent knowledge-heavy 9 / 10

Evaluation

Evaluation: Overview no gold standard test set or benchmark corpora small-scale evaluation with hand-picked test examples DATE: Diachronic text Corpus (years 1710 2010) 1. Coha Corpus (Davies, 2010) 2. SemEval DTE Task Training Data (Popescu, 2015) 3. parts of the CLMET3.0 corpus (Diller, 2011) 10 / 10

Evaluation: Overview no gold standard test set or benchmark corpora small-scale evaluation with hand-picked test examples We evaluate on various previously proposed tasks and metrics 1. qualitative evaluation 2. perceived word novelty (Gulordava, 2011) 3. temporal text classification SemEval DTE (Popescu, 2015) 4. usefulness of temporal dynamics 5. novel word sense detection (Mitra, 2014) 10 / 10

1. Qualitative Evaluation power 1700 1740 1780 1820 1860 1900 1940 1980 love power life time woman heart god tell little day mind power time life friend woman nature love world reason power country government nation war increase world political people europe power time company water force line electric plant day run 11 / 10

1. Qualitative Evaluation p(w k, t) 1780 time 2010 water water company company power power company power power power nuclear power force power water company company power company plant nuclear power line company time power force force force plant nuclear plant plant time power force force water time plant electric electric time utility force line water time electric water water time time company company company time steam day day plant day force company utility time run steam electric line time day time day run run people electric day run steam steam electric electric run utility electric energy steam electric day purchase line steam run water day cost cost day run plant run plant run line people force people run 11 / 10

2. Human-perceived Word Meaning Change (Gulordava (2011)) Task: Rank 100 target words by meaning change. baseball How much did network change between the 1960s and the 1990s?... 4-point scale 0: no change... 3: significant change 12 / 10

2. Human-perceived Word Meaning Change (Gulordava (2011)) Task: Rank 100 target words by meaning change. baseball How much did network change between the 1960s and the 1990s?... 4-point scale 0: no change... 3: significant change Gulordava (2011) s system Compute word vectors from time-specific corpora (shared space): w 1960, w 1990 Compute cosine(w 1960, w 1990 ) Rank words by cosine: greater angle greater meaning change 12 / 10

2. Human-perceived Word Meaning Change (Gulordava (2011)) Task: Rank 100 target words by meaning change. baseball How much did network change between the 1960s and the 1990s?... 4-point scale 0: no change... 3: significant change Spearman s ρ 0.4 0.35 0.3 0.25 Scan Scan-not Gulordava 0.2 12 / 10

3. Diachronic Text Evaluation (DTE) (SemEval, 2015) Task: predict the time frame of origin of a given text snippet President de Gaulle favors an independent European nuclear striking force [...] (1962) Prediction granularity fine 2-year intervals {1700 1702,..., 1961 1963,..., 2012 2014} medium 6-year intervals {1699 1706,..., 1959 1965,..., 2008 2014} coarse 12-year intervals {1696 1708,..., 1956 1968,..., 2008 2020} 13 / 10

3. Diachronic Text Evaluation (DTE) (SemEval, 2015) Scan temporal word representations 883 nouns and verbs from the DTE development dataset T = 5 years K = 8 senses predict time of a test snippet using Scan representations 14 / 10

3. Diachronic Text Evaluation (DTE) (SemEval, 2015) 0.8 accuracy 0.6 0.4 0.2 fine medium coarse Random baseline Scan-not Scan IXA AMBRA accuracy: precision measure discounted by distance from true time 15 / 10

Conclusions A dynamic Bayesian model of diachronic meaning change sense-level meaning change arbitrary time spans and intervals knowledge lean explicit model of smooth temporal dynamics 16 / 10

Conclusions A dynamic Bayesian model of diachronic meaning change sense-level meaning change arbitrary time spans and intervals knowledge lean explicit model of smooth temporal dynamics Future Work learn the number of word senses (non-parametric) model short term opinion change from twitter data 16 / 10

Thank you! lea@frermann.de www.frermann.de 16 / 10

References I 16 / 10

Learning Blocked Gibbs sampling with three components Block 1 Document sense assignments {z} D Block 2 Time-specific sense prevalence parameters {φ} T Time- and sense-specific word parameters {ψ} T K Block 3 Degree of temporal sense flexibility κ φ 17 / 10

Learning Block 2 Word- / sense parameters {φ} T and {ψ} T K Logistic Normal is not conjugate to Multinomial ugly math! auxiliary variable method (Mimno et al, 2008) resample each φ t k ( and ψt,k w ) from a weighted, bounded area 17 / 10

DATE: Diachronic text Corpus 1. The Coha Corpus (Davies, 2010) large collection of text from various genres years 1810 2009 142,587,656 words 2. The SemEval DTE Task Training Data (Popescu, 2015) news text snippets years 1700 2010 124,771 words 3. Parts of the CLMET3.0 corpus (Diller, 2011) texts of various genres from open online archives use years 1710 1810 4,531,505 words 18 / 10

2. Capturing Perceived Word Novelty (Gulordava, 2011) Task: Given a word, predict its novelty in a focus time (1990s) compared to a reference time (1960s). 19 / 10

2. Capturing Perceived Word Novelty (Gulordava, 2011) Task: Given a word, predict its novelty in a focus time (1990s) compared to a reference time (1960s). A gold test set of 100 target words how much did w s meaning change between the 1960s and 1990s? ratings on a 4-point scale [0=no change,..., 3=change significantly] orange 0 crisis 2 net 3 sleep 0 virus 2 program 3 19 / 10

2. Capturing Perceived Word Novelty (Gulordava, 2011) Gulordava et al s system vector space model data: the Google Books bigram corpus compute a novelty score based on similarity of word vectors low similarity significant change Scan data: Date subcorpus covering 1960 1999 ; T = 10, K = 8 we measure word novelty using the relevance score (Cook, 2014) compute sense novelty based on time-specific keyword probabilities (Kilgarriff, 2000) word novelty = max sense novelty 20 / 10

2. Capturing Perceived Word Novelty (Gulordava, 2011) Performance system corpus Spearman s ρ Gulordava (2011) Google 0.386 Scan Date 0.377 Scan-not Date 0.255 frequency baseline Date 0.325 Scan predictions: Most novel words w/ most novel sense (1960s vs 1990s) environmental users virtual disk supra note law protection id agency impact policy factor computer window information software system wireless web reality virtual computer center experience week community hard disk drive program computer file store ram business 21 / 10

3. Diachronic Text Evaluation (DTE) (SemEval, 2015) Task: predict the time frame of origin of a given text snippet subtask 1 explicit cues President de Gaulle favors an independent European nuclear striking force [...] (1962) Prediction granularity fine 2-year {1700 1702, 1703 1705,..., 1961 1963,..., 2012 2014} medium 6-year {1699 1706, 1707 1713,..., 1959 1965,..., 2008 2014} coarse 12-year {1696 1708, 1709 1721,..., 1956 1968,..., 2008 2020} 22 / 10

3. Diachronic Text Evaluation (DTE) (SemEval, 2015) Task: predict the time frame of origin of a given text snippet subtask 2 implicit (language) cues The local wheat market was not quite so strong to-day as yesterday. (1891) Prediction granularity fine 6-year {1699 1705, 1706 1712,..., 1888 1894,..., 2007 2013} medium 12-year {1703 1715, 1716 1728,..., 1885 1897,..., 2002 2014} coarse 20-year {1692 1712, 1713 1733,..., 1881 1901,..., 2007 2027} 22 / 10

3. Diachronic Text Evaluation (DTE) (SemEval, 2015) Scan learn temporal word representations for all nouns and for all verbs that occur at least twice in the DTE development dataset (883 words) T = 5 years, K = 8 23 / 10

3. Diachronic Text Evaluation (DTE) (SemEval, 2015) Scan learn temporal word representations for all nouns and for all verbs that occur at least twice in the DTE development dataset (883 words) T = 5 years, K = 8 Predicting time of a test news snippet 1. Detect mentions of target words {c}; for each target 1.1 construct document with c and ±5 surrounding words w 1.2 compute distribution over time slices : p (c) (t w) p (c) (w t) p (c) (t) 2. combine target-wise predictions into final distribution 3. predict time t with highest probability 23 / 10

3. Diachronic Text Evaluation (DTE) (SemEval, 2015) Scan learn temporal word representations for all nouns and for all verbs that occur at least twice in the DTE development dataset (883 words) T = 5 years, K = 8 Supervised Classification Multiclass SVM SVM Scan 1. arg max k p (c) (k t) (most likely sense from Scan models) SVM Scan+n-gram 1. arg max k p (c) (k t) (most likely sense from Scan models) 2. character n-grams 23 / 10

3. Diachronic Text Evaluation (DTE) (SemEval, 2015) Subtask 1 factual cues 2 yr 6 yr 12 yr 6 yr 12 yr 20 yr Baseline.097.214.383.199.343.499 Scan-not.265.435.609.259.403.567 Scan.353.569.748.376.572.719 IXA.187.375.557.261.428.622 AMBRA.167.367.554.605.767.868 UCD.759.846.910 SVM Scan.192.417.545.573.667.790 SVM Scan+ngram.222.467.627.747.821.897 Scores: accuracy precision measure discounted by distance from true time 24 / 10

3. Diachronic Text Evaluation (DTE) (SemEval, 2015) Subtask 2 linguistic cues 2 yr 6 yr 12 yr 6 yr 12 yr 20 yr Baseline.097.214.383.199.343.499 Scan-not.265.435.609.259.403.567 Scan.353.569.748.376.572.719 IXA.187.375.557.261.428.622 AMBRA.167.367.554.605.767.868 UCD.759.846.910 SVM Scan.192.417.545.573.667.790 SVM Scan+ngram.222.467.627.747.821.897 Scores: accuracy precision measure discounted by distance from true time 24 / 10

3. Diachronic Text Evaluation (DTE) (SemEval, 2015) Subtask 1 Subtask 2 2 yr 6 yr 12 yr 6 yr 12 yr 20 yr Baseline.097.214.383.199.343.499 Scan-not.265.435.609.259.403.567 Scan.353.569.748.376.572.719 IXA.187.375.557.261.428.622 AMBRA.167.367.554.605.767.868 UCD.759.846.910 SVM Scan.192.417.545.573.667.790 SVM Scan+ngram.222.467.627.747.821.897 Discussion did we just use more data? (no) our system is not application specific use different systems for different DTE subtasks 24 / 10