Multi-Component Word Sense Disambiguation

Similar documents
TALP at GeoQuery 2007: Linguistic and Geographical Analysis for Query Parsing

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Personal Project: Shift-Reduce Dependency Parsing

Coarse to Fine Grained Sense Disambiguation in Wikipedia

Lab 12: Structured Prediction

Introduction to Semantics. Common Nouns and Adjectives in Predicate Position 1

Structured Prediction

Probabilistic Context-free Grammars

Extraction of Opposite Sentiments in Classified Free Format Text Reviews

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Lecture 13: Structured Prediction

10/17/04. Today s Main Points

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Sequence Labeling: HMMs & Structured Perceptron

Multilevel Coarse-to-Fine PCFG Parsing

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Parsing with Context-Free Grammars

Natural Language Processing

Introduction to the CoNLL-2004 Shared Task: Semantic Role Labeling

DT2118 Speech and Speaker Recognition

The Noisy Channel Model and Markov Models

MIRA, SVM, k-nn. Lirong Xia

CS460/626 : Natural Language

The SUBTLE NL Parsing Pipeline: A Complete Parser for English Mitch Marcus University of Pennsylvania

Probabilistic Context Free Grammars. Many slides from Michael Collins

Statistical Methods for NLP

Machine Learning for NLP

Hidden Markov Models

Language Processing with Perl and Prolog

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Natural Language Processing

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

Introduction to Semantics. The Formalization of Meaning 1

Midterms. PAC Learning SVM Kernels+Boost Decision Trees. MultiClass CS446 Spring 17

Ch. 2: Phrase Structure Syntactic Structure (basic concepts) A tree diagram marks constituents hierarchically

Prenominal Modifier Ordering via MSA. Alignment

Learning Features from Co-occurrences: A Theoretical Analysis

LECTURER: BURCU CAN Spring

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Discriminative Training

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Language Models & Hidden Markov Models

Parsing. Based on presentations from Chris Manning s course on Statistical Parsing (Stanford)

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

Natural Language Processing : Probabilistic Context Free Grammars. Updated 5/09

From perceptrons to word embeddings. Simon Šuster University of Groningen

A brief introduction to Conditional Random Fields

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

CSCI 252: Neural Networks and Graphical Models. Fall Term 2016 Prof. Levy. Architecture #7: The Simple Recurrent Network (Elman 1990)

Introduction to Semantic Parsing with CCG

Learning Random Walk Models for Inducing Word Dependency Distributions

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Driving Semantic Parsing from the World s Response

Boosting Applied to Tagging and PP Attachment

Parsing with Context-Free Grammars

INF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models

NLP Programming Tutorial 11 - The Structured Perceptron

Features of Statistical Parsers

A Comparative Study of Parameter Estimation Methods for Statistical Natural Language Processing

Margin maximization with feed-forward neural networks: a comparative study with SVM and AdaBoost

Lecture 4: Linear predictors and the Perceptron

Maximum Entropy Klassifikator; Klassifikation mit Scikit-Learn

Semantics and Generative Grammar. A Little Bit on Adverbs and Events

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

ML4NLP Multiclass Classification

WEST: WEIGHTED-EDGE BASED SIMILARITY MEASUREMENT TOOLS FOR WORD SEMANTICS

Statistical NLP Spring 2010

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Bringing machine learning & compositional semantics together: central concepts

Advanced Natural Language Processing Syntactic Parsing

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Outline. Learning. Overview Details

COMS 4771 Introduction to Machine Learning. Nakul Verma

Outline. Learning. Overview Details Example Lexicon learning Supervision signals

Sequences and Information

(7) a. [ PP to John], Mary gave the book t [PP]. b. [ VP fix the car], I wonder whether she will t [VP].

Discriminative Training. March 4, 2014

An Empirical Study on Dimensionality Optimization in Text Mining for Linguistic Knowledge Acquisition

MIL-UT at ILSVRC2014

CSC321 Lecture 16: ResNets and Attention

Learning and Inference over Constrained Output

The relation of surprisal and human processing

Feature Engineering. Knowledge Discovery and Data Mining 1. Roman Kern. ISDS, TU Graz

The Alignment of Formal, Structured and Unstructured Process Descriptions. Josep Carmona

A Context-Free Grammar

Artificial Intelligence

Support Vector and Kernel Methods

Hidden Markov Models (HMMs)

CSCI 315: Artificial Intelligence through Deep Learning

Maschinelle Sprachverarbeitung

Search-Based Structured Prediction

Maschinelle Sprachverarbeitung

Midterm sample questions

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Computationele grammatica

Need for Deep Networks Perceptron. Can only model linear functions. Kernel Machines. Non-linearity provided by kernels

Algorithms for Predicting Structured Data

Kernel Methods and Support Vector Machines

Transcription:

Multi-Component Word Sense Disambiguation Massimiliano Ciaramita and Mark Johnson Brown University BLLIP: http://www.cog.brown.edu/research/nlp Ciaramita and Johnson 1

Outline Pattern classification for WSD Features Flat multiclass averaged perceptron Multi-component WSD Generating external training data Multi-component perceptron Experiments and results Ciaramita and Johnson 2

Pattern classification for WSD English lexical sample: 57 test words: 32 verbs, 20 nouns, 5 adjectives. For each word w: 1. compile a training set: S(w) = (x i, y i ) n x i IR d a vector of features y i Y(w), one of the possible senses of w 2. learn a classifier on S(w): H : IR d Y(w) 3. use the classifier to disambiguate the unseen test data Ciaramita and Johnson 3

Features Standard feature set for wsd (derived from (Yoong and Hwee, 2002)) A-DT newspaper-nn and-cc now-rb a-dt bank-nn have- AUX since-rb taken-vbn over-rb POS of neighboring words - P x,x { 3, 2, 1,0,+1,+2,+3} ; e.g., P 1 = DT, P 0 = NN, P +1 = AUX,... Surrounding words - WS; e.g., WS = take v, WS = over r, WS = newspaper n N-grams: NG x,x { 2, 1,+1,+2} ; e.g., NG 2 = now, NG +1 = have, NG +2 = take NG x,y:(x,y) {( 2, 1),( 1,+1),(+1,+2)} ; e.g., NG 2, 1 = now a, NG +1,+2 = have since Ciaramita and Johnson 4

Syntactic features (Charniak,2000) Governing elements under a phrase - G 1 ; e.g., G 1 = take S Governed elements under a phrase - G 2 ; e.g., G 2 = a NP, G 2 = now NP Coordinates - OO; e.g., OO = newspaper S1 S NP VP. DT NN CC ADVP DT NN AUX ADVP VP. A newspaper and RB a bank have RB VBN PRT now G2 since taken RP over OO G1 Ciaramita and Johnson 5

Multiclass Perceptron (Crammer and Singer, 2003) Discriminant function: H(x; V) = arg max k r=1 v r, x Input: V IR Y(w) d, d 200, 000, initialized as V = 0 Repeat T times - passes over training data or epochs Multiclass Perceptron((x, y) n, V) 1 for i = 1 to i = n 2 do E = {r : v r, x i > v y, x i } 3 if E > 0 4 then 1. τ r = 1 for r = y 5 2. τ r = 0 for r / E {y} 6 3. τ r = 1 E for r E 7 for r = 1 to r = k 8 do v r v r + τ r x i ; Ciaramita and Johnson 6

Averaged perceptron classifier Perceptron s output: V (0),..., V (n) V (i) is the weight matrix after the first i training items Final model: V = V (n) Averaged perceptron: (Collins, 2002) final model: V = 1 n n i=1 V(i) reduces the effect of over-training Ciaramita and Johnson 7

Outline Pattern classification for WSD Features Flat multiclass perceptron Multi-component WSD Generating external training data Multi-component perceptron Experiments and results Ciaramita and Johnson 8

Sparse data problem in WSD Thousands of word senses - 120,000 in Wordnet 2.0 Very specific classes - 50% of noun synsets contain one noun Problem: training instances often too few for fine-grained semantic distinctions Solution: 1. use the hierarchy of Wordnet to find similar word senses and generate external training data for these senses 2. integrate task-specific and external data with perceptron Intuition - to classify an instance of the noun disk additional knowledge about concepts such as other audio or computer memory devices could be helpful Ciaramita and Johnson 9

Finding neighbor senses disc 1 = memory device for information storing disc 2 = phonograph record.. MEMORY_DEVICE... RECORDING AUDIO_RECORDING MAGNETIC_DISC DISC AUDIOTAPE magnetic_disk disk FLOPPY DIGITAL_AUDIOTAPE audiotape isc record platter LP floppy_disk diskette digital_audiotape dat l.p. lp floppy hard_disk HARD_DISK fixed_ Ciaramita and Johnson 10

Finding neighbor senses neighbors(disc 1 ) = floppy disk, hard disk,... neighbors(disc 2 ) = audio recording, lp, soundtrack, audiotape, talking book, digital audio tape,..... MEMORY_DEVICE... RECORDING AUDIO_RECORDING MAGNETIC_DISC DISC AUDIOTAPE magnetic_disk disk FLOPPY DIGITAL_AUDIOTAPE audiotape isc record platter LP floppy_disk diskette digital_audiotape dat l.p. lp floppy hard_disk HARD_DISK fixed_ Ciaramita and Johnson 11

External training data Find neighbors: for each sense y of a noun or verb in the task a set ^y of k = 100 neighbor senses is generated from the Wordnet hierarchy Generate new instances: for each synset in ^y a training instance (x i, ^y i ) is compiled from the corresponding Wordnet glosses (definitions/example sentences) using the same set of features Result: for each noun/verb 1. task-specific training data (x i, y i ) n 2. external training data (x i, ^y i ) m Ciaramita and Johnson 12

Multi-component perceptron Simplification of hierarchical perceptron (Ciaramita et al., 2003) A weight matrix V is trained on the task-specific data A weight matrix M is trained on the external data Discriminant function: H(x; V, M) = arg max y Y(w) λ y v y, x + λ^y m^y, x λ y is an adjustable parameter that weights each component s contribution: λ^y = 1 λ y Ciaramita and Johnson 13

Multi-Component Perceptron The algorithm learns V and M independently Multi-Component Perceptron((x i, y i ) n, (x i, ^y i ) m, V, M) 1 V 0 2 M 0 3 for t = 1 to i = T 4 do Multiclass Perceptron((x i, y i ) n, V) 5 Multiclass Perceptron((x i, y i ) n, M) 6 Multiclass Perceptron((x i, y i ) m, M) Ciaramita and Johnson 14

Outline Pattern classification for WSD Features Flat multiclass averaged perceptron Multi-component WSD Generating external training data Multi-component perceptron Experiments and results Ciaramita and Johnson 15

Experiments and results One classifier trained for each test word Adjectives: standard perceptron, only set T Verbs/nouns: multicomponent perceptron, set T and λ y Cross-validation experiments on the training data for each test word: 1. choose the value for λ y ; λ y = 1 use only the flat perceptron, or λ y = 0.5 use both component equally weighted 2. choose the number of iterations T Average T value = 13.9 For 37 out of 52 nouns/verbs λ y = 0.5; the two-component model is more accurate than the flat perceptron Ciaramita and Johnson 16

English Lexical Sample Results Measure Precision Recall Attempted % Fine all POS 71.1 71.1 100 Coarse all POS 78.1 78.1 100 Fine verbs 72.5 72.5 100 Coarse verbs 80.0 80.0 100 Fine nouns 71.3 71.3 100 Coarse nouns 77.4 77.4 100 Fine adjectives 49.7 49.7 100 Coarse adjectives 63.5 63.5 100 Ciaramita and Johnson 17

Flat vs. Multi-component: cross validation on train ALL WORDS VERBS NOUNS 71.5 72.5 72.5 72 72 71 71.5 71.5 ACCURACY 70.5 71 71 70 70.5 70.5 70 70 69.5 69.5 λ y = 1.0 λ y = 0.5 69.5 69 0 20 40 69 0 20 40 EPOCH 69 0 20 40 Ciaramita and Johnson 18

Conclusion Advantages of the multi-component perceptron trained on neighbors data Neighbors: one supersense for each sense, same amount of additional data per sense Simpler model: smaller variance more homogeneous external data Efficiency: fast and efficient training Architecture: simple, easy to add any number of (weighted) components Ciaramita and Johnson 19