Cross Document Entity Disambiguation. August 22, 2007 Johns Hopkins Summer Workshop

Size: px

Start display at page:

Download "Cross Document Entity Disambiguation. August 22, 2007 Johns Hopkins Summer Workshop"

Madlyn Newman
5 years ago
Views:

1 Cross Document Entity Disambiguation August 22, 2007 Johns Hopkins Summer Workshop

2 Person Entities for CDC Corpora: David Day, Janet Hitzeman Relations: Su Jian, Stanley Yong, Claudio Guiliano, Gideon Mann CDC Features: Jason Duncan, Paul McNamee, Rob Hall, Mike Wick Clustering/Machine Learning: Mike Wick, Rob Hall

3 Problem Description Disambiguate entities across a corpus Document level, Entity level, Mention level Document level disambiguation / Web People Corpora SPOCK corpus (description challenge page discussion forum) High ambiguity level The Jim Smith Society James Smith

4 Google Search for James Smith James Smith Cree Nation James Smith Cree Nation P.O. Box 1059 Melfort, Saskatchewan S0E 1A0. Ph: (306) Fx: (306) k - Cached - Similar pages James Smith (political figure) - Wikipedia, the free encyclopedia James Smith (about 1719 July 11, 1806), was a signer to the United States Declaration of Independence as a representative of Pennsylvania.... en.wikipedia.org/wiki/james_smith_(political_figure) - 22k - Cached Band Details Official Name, James Smith. Number, 370. Address, PO BOX 1059, MELFORT, SK. Postal Code, S0E 1A0. Phone, (306) Fax, (306) sdiprod2.inac.gc.ca/fnprofiles/fnprofiles_details.asp?band_number=370-12k Comox Valley Real Estate: James Smith, your Realtor for Comox... James Smith is your realtor for the Comox Valley area, including Comox, Courtenay, Cumberland, Union Bay, Royston, and Black Creek k - Cached - Similar pages Watercolor Snapshots - by James Smith Watercolor Snapshots by James Smith - your portrait custom painted in watercolor, or the portrait of your relative or friend, painted from your 4 x k - Cached - Similar pages

5 Problem Description Entity level disambiguation (ACE CDC annotatation) PER, GPE, LOC, and ORG entities that have a NAME string on the coreference chain AND are +SPECIFIC Low ambiguity level (B-cubed baseline of 0.80 F versus 0.09 F for Spock corpus for shatter all condition, one node per cluster)

6 Features from previous work Document level bag of words/nerentities features (basically all previous systems) Local contextual information (bags of words/ner entities in local context. Syntactic Features (base NPs in document/local contexts Chen & Martin) Basic Relational Information (Mann & Yarowsky). DOB, POB, etc.

7 Three areas where workshop can make a contribution 1. More and better features 2. More use of relation information from varying sources (ground truth and system generated relations, supervised and unsupervised) 3. Different clustering procedures than standard greedy single-link agglomerative clustering

8 Experiments Document Level Information (Bow, Boe) Mention Level Information (Bow, Boe) Topic Models (SPOCK) Relations ACE (supervised): ORG-AFF 82-F, PER-SOC 91-F SPOCK (unsupervised) Different Clustering Techniques (SPOCK)

9 Entity Disambiguation Michael Wick

10 Which J. Smith? Jazz Musician Politician Student Investor CEO Historian? Web Docs

11 First Order Model Classifier/evaluator SCORE P(Y X) = 1/Z x f(y i,x i ) (1-f(y ij,x ij )) Find configuration to maximize this objective function Greedy Agglomerative Approximation

12 ? = (Traditional Features) Bags of Words Title overlaps Named Entities Chunking TFIDF term selection

13 Two articles on John Smith, the jazz musician and John Smith on saxophone.. his melodic improvisation.. NO WORDS IN THE EXCERPTS OVERLAP!!!

14 Which J. Smith? 1 university program learning students education 2 ashley love ash miss hey his Politician melodic improvisation.. 3 tx la christi corpus san 4 company insurance financial income investor Student 5 contact city state fullname 6 masters music jazz folk orchestra Investor 7 registered status category read japan CEO 8 museum county historical society kansas and John Smith on saxophone.. 9 photography times braunfels jp rail Historian 10 film alex films kill night 11 senate senator section court company Jazz Musician

15 Results With Topics Precision Recall F1 B-Cubed topics Pairwise topics MUC topics Chunks+TitleOverlap + TFIDF + NER + Relations + TOPICS!!!

16 Metropolis-Hastings Sampler P(Y X) = 1/Z x f(y i,x i ) (1-f(y ij,x ij )) Requires summing over all possible configurations P(y y) = Max[ p(y )/p(y) * p(y y )/p(y y),1] Likelihood ratio: partition function cancels! Probability of Accepting Jump Inverse ratio of making a jump vs. reversing that jump

17 Metropolis-Hastings 1. Initialize with Greedy Agglomerative A B Pick a block from cohesion distribution E C D B 3. Pick new assignment uniformly at random F E 4. Accept with probability: P(y y) = Max[ p(y )/p(y) * p(y y )/p(y y),1] (in this example we reject)

18 Deriving Cohesion Distribution A B 0.97 Block Distribution 0.91 C F 0.84 D 0.51 E BE.97 AC.91 BED.84 BEDF.84

19 Results Precision Recall F1 B-Cubed w/mh Pairwise w/mh MUC w/mh Metropolis Hastings Greedy Agglomerative

20 CDC results Precision Recall F1 B-Cubed DOC MEN CHAIN LEX REL Pairwise DOC MEN CHAIN LEX REL MUC DOC MEN CHAIN LEX REL BCubed (shattered): pr=1.0 re=.67 F1=.80

21 Conclusions Topics enable additional and powerful features Metropolis-Hastings improves upon greedy method

22 Generative, Unsupervised Models for Web People Disambiguation Rob Hall (UMass Amherst)

23 A Simple Generative Process Each document belongs to some hidden entity. e w 0 w 1 w n Cluster by determining the mode of the posterior distribution of e given the words. A sequence of observations is generated, conditioned on the entity.

24 Word Model Entity in document Dirichlet Process e w W D G Θ α β (Symmetric) Dirichlet Prior Bag of words (ignore sequential nature) Per-entity multinomial over words

25 Approximate Inference: Collapsed Gibbs Start with some initialization of all e. Then resample each e k in turn from the distribution: k k k k k k pe ( e, w ) pw ( e) pe ( e ) With e -k sampled we can use CRP: k k ni pe ( = i e ) k k α n 1 +α p( e = new e ) n 1 +α Then can integrate out Θ using Polya-urn scheme: wi wi k k k e 0.. i 1 = 0 m w w i β+ C k + C e 0.. i 1 wv pw ( w.. w e) β+ C + C

26 Self-stopping Word Model Same as before e w G Θ α Asymmetric Dirichlet βprior (learned). z W D Bernoulli (binary) variable that determines whether w is drawn from the local or global distribution (gibbs sampled). Θ G γ Global distribution Over words

27 Approximate Inference: Collapsed Gibbs The new probability model is: k k k k k k k k k pe ( e, w ) pw ( z, e, w ) pz ( e, w ) pe ( e ) This requires sampling z: pz ( = 0 e, wz, ) β+ C k k e i w w i k β+ C w e k pz ( = 1 e, wz, ) wi γ + C wi k k G i w w γ + CG w Then when calculating p(e w) only use the w which correspond to z = 0. (The probability of words from the global topic is absorbed into the normalizing constant).

28 Incorporating other Evidence Duplicate the bag of words probability model for each other class of observation. Bag of observations for each evidence class. e w z W D G Θ Θ G α β γ C

29 Gibbs Sampling in this Model The new probability model is: k k k k k k k k k k pe ( e, wz, ) pe ( e ) pw ( z, e, w ) pz ( e, z w ) c c c c c Start with an initialization of e Iterate over each document k: For each type of observation c: Resample z k c Resample e k

30 Spock Results Model B-Cubed Pairwise P R F1 P R F1 Words Words-Stop Words-URL Words-URL-Stop Words-URL-WN Words-URL-WN-Stop

Topic Models and Applications to Short Documents

Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43 Outline Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text