Semantic Similarity and Relatedness

Semantic Relatedness Semantic Similarity and Relatedness (Based on Budanitsky, Hirst 2006 and Chapter 20 of Jurafsky/Martin 2 nd. Ed. - Most figures taken from either source.) Many applications require measure of relation between words, even if they are not collocated: WSD, Information Retrieval, Query Translation Identifying Relatedness Synonyms: Thesaurus, Wordnet {favored, popula preferred} Same category terms: Thesaurus Categories, other Ontologies, Wordnet Hierarchy {Roget s Thesaurus Catgory like: admire delight, attract, unwelcome} Context based similarity WordNet Path Based Similarity Path Length: Limitations Simplest measure: pathlength: #edges in shortest path between sense nodes c1 and c2 sim (c1,c2) = log pathlen(c1,c2) wordsim(w1,w2) = max (c1 senses(w1), c2 senses(w2)) sim(c1,c2) Figure from Budanitsky, Hirst 2006 Edges at different heights carry different information Wu, Palmer 1994 lso: lowest super-ordinate

Path Length: Further Limitations Even at same height, all edges are not equal Compare <credit, credit card> with <money, cash> Need a weight on edges Measuring P(c): Information Content IC(c) = -log P(c) Note that c refers to a category All words subsumed by c are also counted More common a word, lower is its IC = log(1/p(c)) Key Idea (Jiang Conrath 1997): semantic distance of the link connecting a childconcept c to its parent-concept par(c) is proportional to the conditional probability p(c par(c)) From Lin 1998 IC based Edge Weighing (Jiang Conrath 1997) Key Idea: semantic distance of the link connecting a child-concept c to its parent-concept par(c) is proportional to the conditional probability p(c par(c)) Lin 97 Derived the same formulae slightly differently Word similarity is about commonality vs. differences. (Lin 1998) Simplifying Note that with this edge weight definition PathWeight(c, Root) = IC(c) Putting PathWeight in WuPalmer gives =.59 Jiang Conrath Distance = { IC(c1) IC(lso(c1,c2)) } + { IC(c2) IC(lso(c1,c2)) } WordNet Based Similarity: Key Ideas All edges are not equal, edges at different heights carry different information Even at same height, all edges are not equal, need weight on edges Edge weight as conditional probability p(c par(c)) Information Content: IC(c) = -log P(c) EdgeWeight(c, par(c)) = IC(c) IC(par(c)) PathWeight(c, Root) = IC(c) Integrating WuPalmer with PathWeight gives Same can be derived by considering that Similarity is about commonality vs. differences Putting EdgeWeight in sim (c1,c2) = log pathlen(c1,c2), gives Gives the best result on several applications, with English Wordnet Compared to WuPalmer ignoring depth already taken care of by IC.

Similarity vs. Relatedness Previous methods may not work for words belonging to different classes: car and petrol Gloss Overlap as a similarity measure Extended Gloss Overlap Concepts and Figures from Banerjee and Pedersen 2003 Instead of wordnet relations, some other method can be used for extending the definitions Distribution Hypothesis Similarity using Co-occurrence Vectors words that occur in the same contexts tend to have similar meanings (Harris, 1954) "a word is characterized by the company it keeps" Firth (1957) Commanality vs. Differences All these measures change value with scaling

Normalized Measure Cosine Measure: Applications From FSNLP From Patwardhan, Pederson 2006 Right measure depends on the application What happens when we scale the values What happens when we add dimensions/features Consider a vector with 4 entries. What happens if we add 20 more entries to it How similar is it to the original vector as per different measures De facto standard for term vector comparison Cosine Similarity has several other Applications In IR, Query vs. Document Similarity Comparing Documents Finding Obfuscated code for virus detection or assignment copy detection Tuning the Context Vectors Using Dependency Relations Remove the very high frequency and very low frequency terms Weigh the terms using tf*idf (o inverse definition/context frequency in this case) Term Weight: 1 + log (tf) IDF: log(n/df) Use PMI instead of counts PMI (w1, w2) = log { P(w1, w2)/p(w1)p(w2) } Only use context words that have a dependency relation Use PMI conditional on dependency relation Final Similarity Measure C( w, w') C(*, *) I ( w, w' ) = log C( w, *) C(*, w' ) Formulae and figure below from Lin 98. Show that it is equivalent to PMI Conditional on r T(w): set of pairs (w ) s.t. PMI(w,w ) is positive Full Parsing is expensive: Use Shallow Parsing only subject, object, modifier etc. Using Probabilistic Measures Second Order Measures When counts are replaced by probabilities then cosine distance may not be the best metric. Different vector entries are not co-operating but competing We need to compare probability distributions KL-Divergence: How well distribution Q approximated distribution P, or how much information is lost if we assume Q instead of P Asymmetric, Undefined when Q is 0 Jenson Shannon Divergence Compare two words using cosine similarity of their co-occurrence vectors Standard Approach Knowledge Based Approach: Overlaps based on definitions Marrying the two: Second Order Context Vectors Extend the definitions with the centroid of the context vectors of the words in the definition (Patwardhan, Pederson 2006)

Similarity of Short Contexts: Applications Reducing Dimensions: Singular Value Decomposition (SVD) From lion.cs.uiuc.edu From Pedersen 2009 From www.mathworks.com Note the similarity between d2 and d3: Discovers that cosmonaut And astronaut are related, as They both co-occur with moon FSNLP Chap. 15.4