Semantic istances & LSA Stefan rausan-matu Politehnica" University of Bucharest an Romanian Acaemy Research Institute for Artificial Intelligence Bucharest, Romania stefan.trausan@cs.pub.ro http://www.racai.ro/~trausan 1
Lexical chains Venture capitalists have become culture heroes in the New Economy. hey're the visionary investors who sink millions of ollars into risky start-up companies, take seats on their boars to mentor the tech ners--an then cash out at a huge profit. he name "San Hill Roa," the street in Menlo Park where Kleiner Perkins an some of the other famous venture capitalists have their offices, has assume an almost magical quality, as if the partners were ispensing fairy ust rather than cash. 29-Apr-11 S. rausan-matu 2 2
Applications of lexical chains Verification of text cohesion ext segmentation Summarization Wor sense isambiguation Determination of iscourse structure Automatic hypertext generation Intelligent spelling checking Information retrieval 29-Apr-11 S. rausan-matu 3 3
Builing lexical chains ext scanning an etection of semantically relate wors small istance Constructing a set of potential lexical chains Computational problems 29-Apr-11 S. rausan-matu 4 4
Low semantic istance = high similarity bank money bank river apple fruit orange-fruit pen paper pen pencil men-crow re-blue man-human woman-human man-men men-women man-bike wheel bike sweet-bitter sweet-essert esert-storm esert-efect singer-song hacker-soft 29-Apr-11 S. rausan-matu 5 5
Closeness relations Synonym Hyponym/hypernim Meronym/Holonym Antonym Entailment ypicality 29-Apr-11 S. rausan-matu 6 6
High semantic istance bike-cat software-og og-amoeba iea-sleep runner-paint 29-Apr-11 S. rausan-matu 7 7
Semantic istance accoring to: Dictionaries (Kozima an Ferugori, Kozima an Ito) hesauri (e.g. Roget Morris an Hirst; Bunrui goihyo Japanese thesaurus Okamura an Hona) Semantic networks (e.g. MeSH Meical Subject Heaings Raa) Ontologies WorNet, FrameNet 29-Apr-11 S. rausan-matu 8 8
Chains of Wornet senses [venture capitalist(1), hero(1), visionary(1), investor(1), mentor(1), ner(1), profit(2), venture capitalist(1), quality(1), partner(1), fairy(1)] 29-Apr-11 S. rausan-matu 9 9
Raa et al. ist R (c1,c2)=min nr of eges between c1 an c2 29-Apr-11 S. rausan-matu 10 10
Hirst an St Onge Morris an Hirst applie for WorNet Directions: ownwar (cause, hyponym, holonym, entailment) upwar (hypernym, meronym) horizontal (similar, participle_of, see_also, antonyme, attribute) 29-Apr-11 S. rausan-matu 11 11
Hirst an St Onge Relations: extra-strong wor repetition strong same synset (esert efect) antonyms (col-hot) sub-phrase (school elementary school) meium-strength 29-Apr-11 S. rausan-matu 12 12
rel HS = Meium strength 3C, for extra-strong relations 2C, for strong relations C path_length (k * # changes_in_irection), for meium strength relations 0 otherwise 29-Apr-11 S. rausan-matu 13 13
Forbien sequences No other irection can precee an upwar irection Only one irection change is allowe Exception: upwar horizontal - ownwar 29-Apr-11 S. rausan-matu 14 14
Forbien sequences 29-Apr-11 S. rausan-matu 15 15
Allowe sequences 29-Apr-11 S. rausan-matu 16 16
Sussna Weight of ege epens on fanout nr: the number of arcs leaving c. w(c1 c2) = 2 1 / nr(c1) If c1 an c2 are ajacent: Dist S (c1,c2)=(w(c1 c2)+w(c2 c1)) / 2 = the epth of the ege in the ontology If not aiacent a istances to shortest paths 29-Apr-11 S. rausan-matu 17 17
Wu si Palmer c = lso(c1;c2), lso=lowest super orinate sim WP (c1,c2)=(2xn)/(n1+n2+2xn), Ni the path from ci to c N the path from c to the root ist WP (c1,c2)=(n1+n2)/(n1+n2+2xn), 29-Apr-11 S. rausan-matu 18 18
Leacock si Choorow sim LC (c1,c2)=-log(1+length(c1,c2))/(2xd)) length(c1,c2) is Raa s shortest path D is the height of the ontology 29-Apr-11 S. rausan-matu 19 19
Resnik he similarity of two concepts is their share information information content of the lowest superorinate Information content of c is given by -log p(c) the less frequent it is, the more information it contains. sim R (c1,c2) = -log p(lso(c1;c2)) 29-Apr-11 S. rausan-matu 20 20
Lin he similarity between arbitrary objects A an B is measure by the ratio between the amount of information neee to state their commonality an that neee to fully escribe what they are. 29-Apr-11 S. rausan-matu 21 21
Lin sim L (c1,c2)=2log p(lso(c1,c2))/ (log p(c1)+log p(c2)) 29-Apr-11 S. rausan-matu 22 22
Jiang an Conrath ist JC (c1,c2)=2log p(lso(c1,c2)) - (log p(c1)+log p(c2)) 29-Apr-11 S. rausan-matu 23 23
Problems in etecting lexical chains Uner-chaining the ifficulty of fining semantically relate wors Over-chaining wrong etermination of semantically unrelate wors 29-Apr-11 S. rausan-matu 24 24
Uner-chaining Name entitities recognition Venture capitalists have become culture heroes in the New Economy. hey're the visionary investors who sink millions of ollars into risky start-up companies, take seats on their boars to mentor the tech ners--an then cash out at a huge profit. he name "San Hill Roa," the street in Menlo Park where Kleiner Perkins an some of the other famous venture capitalists have their offices, has assume an almost magical quality, as if the partners were ispensing fairy ust rather than cash. 29-Apr-11 S. rausan-matu 25 25
Uner-chaining Coreference resolution Venture capitalists have become culture heroes in the New Economy. hey're the visionary investors who sink millions of ollars into risky start-up companies, take seats on their boars to mentor the tech ners--an then cash out at a huge profit. he name "San Hill Roa," the street in Menlo Park where Kleiner Perkins an some of the other famous venture capitalists have their offices, has assume an almost magical quality, as if the partners were ispensing fairy ust rather than cash. 29-Apr-11 S. rausan-matu 26 26
Over-chaining Venture capitalists have become culture heroes in the New Economy. hey're the visionary investors who sink millions of ollars into risky start-up companies, take seats on their boars to mentor the tech ners--an then cash out at a huge profit. he name "San Hill Roa," the street in Menlo Park where Kleiner Perkins an some of the other famous venture capitalists have their offices, has assume an almost magical quality, as if the partners were ispensing fairy ust rather than cash. 29-Apr-11 S. rausan-matu 27 27
Semantic spaces in Latent Semantic Inexing (LSI) 29-Apr-11 S. rausan-matu 28 28
Vector space moel 29-Apr-11 S. rausan-matu 29 29
he LSI iea Reucing the imensionality of the vector space, similarly to the least squares metho he effect is the creation of semantic spaces containing semantically relate wors http://lsa.colorao.eu 29-Apr-11 S. rausan-matu 30 30
31 29-Apr-11 S. rausan-matu 31 erms-ocuments array (ex. from Manning an Schutze, 1999) = 1 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 1 cos 6 5 4 3 2 1 truck car moon astronaut monaut A
Singular value ecomposition (SVD) A = tx txn S nxn D xn n=min(t,) 29-Apr-11 S. rausan-matu 32 32
33 29-Apr-11 S. rausan-matu 33 = 0.09 0.16 0.61 0.73 0.25 im5 0.58 0.58 0.58 im 4 0.41 0.15 0.37 0.59 0.57 im3 0.65 0.35 0.51 0.33 0.30 im 2 0.26 0.70 0.48 0.13 0.44 im1 cos truck car moon astronaut monaut
34 29-Apr-11 S. rausan-matu 34 S = 0.39 1.00 1.28 1.59 2.16 S
35 29-Apr-11 S. rausan-matu 35 D = 0.22 0.41 0.19 0.63 0.29 0.53 im5 0.58 0.58 0.58 im 4 0.33 0.12 0.20 0.45 0.75 0.28 im3 0.41 0.22 0.63 0.19 0.53 0.29 im 2 0.12 0.33 0.45 0.20 0.28 0.75 im1 6 5 4 3 2 1 D
Properties of SVD SVD is unique, D are orthonormal: = S values are sorte D D = I 29-Apr-11 S. rausan-matu 36 36
Reuce A = A A ) 2 By SVD on maps the n-imension space on a k-imension one, with n >>k Common values for k are 100 an 150. 29-Apr-11 S. rausan-matu 37 37
B B = S 2x2D x 2 B = im1 im 2 1 2 3 4 5 1.62 0.60 0.04 0.97 0.71 0.46 0.84 0.30 1.00 0.35 6 0.26 0.65 29-Apr-11 S. rausan-matu 38 38
39 29-Apr-11 S. rausan-matu 39 Document correlation (Manning an Schutze, 1999) 1.00 0.74 0.93 0.87 0.54 0.10 1.00 0.94 0.32 0.16 0.74 1.00 0.62 0.18 0.47 1.00 0.88 0.40 1.00 0.78 1.00 6 5 4 3 2 1 6 5 4 3 2 1 B B SD SD DS DS SD DS SD SD A A = = = = = ) ( ) ( ) )( ( ) (
erm correlation AA = SD ( SD ) = SD DS = ( S)( S) 29-Apr-11 S. rausan-matu 40 40