Determining Word Sense Dominance Using a Thesaurus

Determining Word Sense Dominance Using a Thesaurus Saif Mohammad and Graeme Hirst Department of Computer Science University of Toronto EACL, Trento, Italy (5th April, 2006) Copyright cfl2006, Saif Mohammad and Graeme Hirst Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 1

Word Sense Dominance star (CELESTIAL BODY) star (CELEBRITY) The degree of dominance of a sense of a word is the proportion of occurrences of that sense in text. Applications: Ξ Sense disambiguation, document clustering,... Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 2

McCarthy et al. s Method SIMILARLY SENSE DISTRIBUTED TARGET TEXT D = dominance method Requires WordNet. WORDNET D DOMINANCE VALUES LIN S THESAURUS A U X I L I A R Y C O R P U S Needs auxiliary text with similar sense distribution. Requires retraining (Lin s thesaurus). Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 3

Our Method TARGET TEXT PUBLISHED THESAURUS D DOMINANCE VALUES WCCM WCCM = word category co-occurrence matrix A U X I L I A R Y C O R P U S We use a published thesaurus. Auxiliary text need not have similar sense distribution. No retraining is needed (WCCM created just once). Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 4

Published Thesauri E.g., Roget s (English), Macquarie (English), Cilin (Chinese), Bunrui Goi Hyou (Japanese) Vocabulary divided into about 1000 categories Words in a category (category terms or c-terms) are Ξ closely related. A category very roughly corresponds to a sense Ξ (Yarowsky, 1992). One word, more than one category Ξ bark in ANIMAL NOISES and MEMBRANE. Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 5

Why a Thesaurus? Coarse senses: WordNet is much too fine grained. Computational ease: With just 1000 categories, the word category co-occurrence matrix is of manageable size. Availability: Thesauri are available in many languages. Words for a sense: Each sense can be represented unambiguously with a set of (possibly ambiguous) words. Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 6

Word Category Matrix c 1 c 2 ::: c j ::: w 1 m 11 m 12 ::: m 1 j ::: 2 ::: m 2 j :::.. ::: :::.... w m 21 m 22 i ::: :::........ w m i1 m i2 m ij WCCM: categories (thesaurus) vs. words (vocabulary) Cell m ij : number of times word w i co-occurs with a c-term listed in category c j Text: most of the BNC Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 7

Example CELESTIAL BODY space star CELEBRITY cell (space, CELESTIAL BODY) incremented by 1 cell (space, CELEBRITY) incremented by 1 Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 8

Example (continued) CELESTIAL BODY space X... X: star, nova, constellation, sun, empty, web Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 9

Word Category Matrix CELESTIAL c 1 c 2 ::: ::: BODY w 1 m 11 m 12 ::: m 1 j ::: w 2 m 21 m 22 ::: m 2 j :::..... :::. ::: space m i1 m i2 "" ::: m :::........ Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 10

Contingency Table for w and c w: :: c :c w n wc n :w n :c n Applying a statistic gives the strength of association (SoA) cosine Dice odds ratio pointwise mutual information Yule s coefficient of colligation Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 11

Evidence for the Senses SoA CELESTIAL BODY space SoA star CELEBRITY Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 12

Base WCCM Matrix created after the first pass of unannotated text noisy Ξ Ξ captures strong associations Words that occur close to a target word Ξ Good indicators of intended sense Ξ Co-occurrence frequency used as evidence Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 13

Bootstrapping the WCCM Second pass of the auxiliary corpus Word sense disambiguation: using co-occurring words Ξ and evidence from base WCCM New, more accurate, WCCM Cell m Ξ ij : number of times word used in sense c j co-occurs with w i Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 14

Four Methods Implicit sense disambiguation Explicit sense disambiguation Weighted voting D I,W D E,W Unweighted voting D I,U D E,U The stronger the association of a sense with its co-occurring words, the higher is its dominance. Weighted vote (SoA) to each sense or unweighted vote to sense with the highest SoA Explicit word sense disambiguation or not Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 15

Method: D I;W Each word that co-occurs with the target word t gives a weighted vote (SoA) to each sense. Dominance of a sense c is the proportion of votes it gets. D I;W(t ;c) = w2t SoA(w;c) 2senses(t) c0 w2t SoA(w;c ) 0 T is the set of all words that co-occur with t. Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 16

Method: D I;U Each word that co-occurs with the target word gives an unweighted, equal vote to a winner sense. Sense with highest strength of association with cooccurring Ξ words Dominance of a sense is the proportion of votes it gets. D jfw T : argmax c ) 2senses(t) = cgj I;U(t 0 2 SoA(w;c0 = ;c) j jt Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 17

Methods: D E ;W and D E ;U Explicit sense disambiguation Votes from co-occurring words Ξ Votes can be weighted or unweighted Ξ Dominance of a sense Ξ Proportion of occurrences pertaining to that sense Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 18

Experimental Setup Naïve sense disambiguation system Ξ Gives predominant sense as output Test datasets Different sense distributions of the two most dominant Ξ senses of each target word Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 19

Sense-tagged Data We created pseudo-thesaurus-sense-tagged data for the 27 head words in SENSEVAL-1 English Sample Space using the held out subset of BNC. Non-monosemous target word: brilliant Category: INTELLIGENCE Monosemous c-term: clever Sentence from auxiliary text: Hermione had a clever plan. Sense annotated sentence: Hermione had a brilliant//intelligence plan. Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 20

Best Results Accuracy 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sense distribution (alpha) Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 21

Best Results Accuracy 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Upper bound 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sense distribution (alpha) Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 22

Best Results Accuracy 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Upper bound Random baseline 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sense distribution (alpha) Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 23

Best Results Accuracy 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Upper bound Random baseline Lower bound 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sense distribution (alpha) Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 24

Best Results Accuracy 1 0.9 0.8 D E,W Mean distance below upper bound 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sense distribution (alpha) Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 25

Best Results Accuracy 1 0.9 0.8 D E,W Mean distance below upper bound 0.7 0.6 0.5 D E ;W :.02 pmi, odds, Yule 0.4 0.3 0.2 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sense distribution (alpha) Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 26

Best Results Accuracy 1 0.9 0.8 D E,W D I,W Mean distance below upper bound 0.7 0.6 0.5 D E ;W :.02 pmi, odds, Yule D I;W :.03 0.4 0.3 pmi 0.2 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sense distribution (alpha) Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 27

Best Results Accuracy 1 0.9 0.8 0.7 0.6 0.5 D E,W D I,W D I,U D E,U Mean distance below upper bound D E ;W :.02 pmi, odds, Yule D I;W :.03 0.4 0.3 0.2 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sense distribution (alpha) pmi D I;U:.11 phi, pmi, odds, Yule D E ;U:.16 phi, pmi, odds, Yule Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 28

Observations Accuracy 1 0.9 0.8 0.7 0.6 0.5 0.4 D E,W D I,W D I,U D E,U Weighted methods are better Explicit or implicit disambiguation does not matter 0.3 0.2 0.1 Odds, pmi, and Yule are better 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sense distribution (alpha) Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 29

Effect of Bootstrapping Accuracy 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 DE,W (odds) bootstrapped 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sense distribution (alpha) Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 30

Effect of Bootstrapping Accuracy 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 DE,W (odds) bootstrapped D E,W (odds) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sense distribution (alpha) Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 31

Effect of Bootstrapping Accuracy 1 0.9 0.8 DE,W (odds) bootstrapped D E,W (odds) Most of the gain given by the first iteration. 0.7 0.6 0.5 0.4 0.3 Relative behavior of measures more or less the same. 0.2 0.1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Sense distribution (alpha) Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 32

In Summary New methods for determining sense dominance Raw text and a published thesaurus Ξ No similarly-sense-distributed text or re-training Ξ Extensive experiments Ξ Synthetically created thesaurus-sense-tagged data Results are close to the upper bound Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 33

Future Work WCCM has applications beyond sense dominance. Linguistic distances Distributional distance of concepts Word sense disambiguation Unsupervised naïve Bayes classifier Machine translation Domain-specific translational dominance Document clustering Represent document in concept space Determining Word Sense Dominance using a Thesaurus. Saif Mohammad and Graeme Hirst. 34