Unsupervised Coreference of Publication Venues

Size: px

Start display at page:

Download "Unsupervised Coreference of Publication Venues"

Debra Carpenter
5 years ago
Views:

1 Unsupervised Coreference of Publication Venues Robert Hall, Charles Sutton, and Andrew McCallum Deparment of Computer Science University of Massachusetts Amherst Amherst, MA December 21, 2007 Abstract Information about the venues of research papers is useful for information retrieval and for automatic mining of the literature. Important to processing venue information is venue coreference, the task of determining which possibly dissimilar mentions of venues refer to the same underlying venue. A natural unsupervised technique for this problem is generative mixture modeling, and indeed such models have been successfully applied to paper and author coreference. But standard models perform poorly on venue strings, because venue strings exhibit greater variance than title or author strings. In this paper, we exploit the fact that venues have characteristic distributions over titles. We do this using a generative model that explicitly models a venue-specific distribution over title words. The model uses a single set of latent variables to control two disparate clustering models: a Dirichlet-multinomial model over titles, and a nonexchangeable string-edit model over venues. Incorporating title information yields a substantial improvement in performance a 58% reduction in error over a standard Dirichlet process mixture. The model successfully disambiguates several venues that have string-identical abbreviations. 1 Introduction An important aspect of any paper is where it is published. Accurate information about a paper s venue is important for the user interface of search engines such as Google Scholar, Rexa, and Citeseer, and for performing data mining over the research literature. Particularly useful is venue coreference, that is, given a set of strings that name venues, such as are listed in the bibliographies of research papers, to determine which strings name the same underlying venue. Coreference also known as deduplication, record linkage, and identity uncertainty is a problem that arises in natural language processing and knowledge discovery. The coreference problem is difficult because a single entity can be named by dissimilar strings for example, AAAI and Proceedings of the Fourteenth National Conference on Artificial Intelligence and different entities can be denoted by identical strings for example, ISWC is a commonly-used abbreviation both for the International Semantic Web Conference, and for the International Symposium on Wearable Computers. Unsupervised methods for coreference problems have attracted much recent interest [Carbonetto et al., 2005, Pasula et al., 2003, Haghighi and Klein, 2007], including models that have been applied to coreference among research papers. However, less attention has been paid to observed venue strings, which have very different properties than observed title and author strings. For observed title strings, we expect that many citations will list the canonical title, while others have small, weakly correlated typographical errors. For observed venue strings, on the other hand, the edit distance between coreferent strings is much larger: 1

2 α β x c α β Φ α β x C c C v ϴ λ x c v ϴ λ z C Φ g t v t N(t m ) N(t m ) M M M Figure 1: At left, finite mixture model over venues. The DP mixture is the infinite limit of this model. In middle, venue-title DP mixture. At right, venue-title DP mixture with latent-dirichlet model over titles. several words may be added or deleted. Furthermore, while it is reasonable to model typos in title strings as independent, in venue strings often several variants appear equally commonly. These two aspects of venue coreference tend to severely hurt the performance of the standard mixture models that have been used for title and author coreference, because on venues they do not perform enough merges. In this paper, we address these difficulties in venue coreference by exploiting information from the papers titles. Venues tend to focus on specific research areas, and those areas are reflected in the titles of the papers that they publish. Our model of the venue strings is a Dirichlet process mixture, following previous work [Carbonetto et al., 2005, Haghighi and Klein, 2007]. The main novelty of our model is that it uses a single set of mixture components is combine two disparate clustering models: a Dirichlet-multinomial mixture, and a non-conjugate string-edit distortion model. In the current application, this means that each underlying venue generates titles according to a venue-specific language model, which encourages merging clusters with similar title distributions, even if their distributions over venue strings are somewhat different. On real-world citation data, this model performs substantially better than a standard DP mixture, yielding a 58% reduction in error. The focus of this paper, however, is the application of recent nonparameteric Bayesian methods to a new, rich task to which it is particularly well-suited. One of the chief advantages of graphical modeling is its modularity; standard modules such as DP can be flexibly mixed with domain-specific modeling. While earlier work, such as that of Carbonetto et al. shows how standard nonparametric Bayesian methods can be applied to related coreference problems, the current work provides a case study in how improved accuracy can be obtained by tailoring standard modeling techniques to the characteristics of a particular problem. 2 Model In this section, we describe our model of venue and title mentions. Each mention (v, t) is a rendering of a paper s title t and venue that v, such as from the bibliography of a citing research paper. These strings may contain typographic and other errors. The data set as a whole is a set of mentions {(v m, t m )} M m=1. Each venue mention v m is a sequence of words (v m1, v m2,... v m,n(m) ) and each title mention a sequence of words (t m1, t m2,... t m,n(m) ). In the remainder of this section, we describe our model by incrementally augmenting a simple finite mixture model, including an MCMC algorithm at each step. All of our models 2

3 are mixture models in which each mixture component is interpreted as an underlying venue. First, we describe a finite mixture model of the venue mentions only, using a string-edit model customized for this task (Section 2.1). Second, we modify this model to allow an infinite number of components by using a Dirichlet process mixture (Section 2.2). Then, we augment this model with title mentions that are drawn from a per-venue unigram model (Section 2.3). Finally, we describe a venue-title model in which the titles are drawn from a latent Dirichlet allocation (LDA) model [Blei et al., 2003] (Section 2.4). 2.1 Finite Mixture Model over Venues First we describe a finite mixture model, where the number of venues C is chosen in advance. The mixture proportions β are sampled from a symmetric Dirichlet with concentration parameter α. Each cluster c {1... C} is associated with a canonical venue string x c, which is sampled from a bigram language model with uniform transition and emission probabilities. For each mention, the model selects a venue assignment c m according to the venue proportions β. Finally, we generate the venue mention v m = v m,0 v m,a by distorting the venue s canonical string x cm = x cm,0 x cm,b. The distortion probability p(v m x cm ) is modeled by an HMM string-edit model with three edit operations: substitute in which a token x cm,i is replaced by either an abbreviation, or a lengthening of itself, insert which generates a token of v m, and delete which removes a token of x cm. The distortion model is a finite state machine, in which there are three states corresponding to the three edit actions, and the emmisions from each state are conditioned on the corresponding token of x cm. We assume transition probabilities p(s i = insert s i 1 ) = p(s i = delete s i 1 ) = 0.3 and p(s i = substitute s i 1 ) = 0.4, so the model favors substituting abbreviations for words and vice versa. The emission probability of the delete state is 1, because we condition on the token that was deleted from the canonical string. From the insert state we assume uniform emission probability over the vocabulary of venue tokens. When subsituting v m,j for x cm,i, the emission probability is: p(v m,j s i,j = substitute, x cm,i) = { 1 a(x cm,i) 1 l(v m,j) if x cm,i starts with v m,j if v m,j starts with x cm,i in which a(w) is the number of words in the vocabulary that are prefixes of w, and l(w) is the number of words for which w is a prefix. Calculating the probability of a canonical string generating a venue mention requires summing over all sequences of edit states. This is done by the analog of the forward algorithm for HMMs in this finite state machine. In summary, the finite-mixture model is β Dirichlet(α1) (1) x c Bigram(1, 1) (2) c m β Discrete(β) v m x, c m StringEdit(x cm ) The graphical model is shown in Figure 1. This model requires choosing the number of venues in advance, which is obviously unrealistic. In the next section, we remove this requirement. We can sample from the distribution p(c 1..., c M, x 1,..., x C v) using Gibbs sampling. We integrate out β, so that the state of the sampler is (c 1..., c M, x 1,..., x C ). 2.2 Dirichlet Process Mixture over Venues The finite mixture model requires specifying a number of clusters a priori, which is unrealistic. For this reason, recent work in unsupervised coreference [Carbonetto et al., 2005, 3

4 Haghighi and Klein, 2007] has focused on nonparametric models, and in particular the infinite limit of (2), which is the Dirichlet process (DP) mixture. A Dirichlet process is a particular class of distributions over distributions. In a DP mixture, this distribution over distribution is used as a distribution over the mixing proportions. A DP is attractive for two reasons: first, the number of components is unbounded, but the number that appear in a single sample is always finite; and second, samples from the induced cluster identities display a rich-get-richer property that is natural in many domains. For a review of modeling and inference using DPs and DP mixtures, see Teh et al. [2006]. The resulting model is β DP(α, 1) x c Bigram(1, 1) c {1, 2,...} (3) c m β Discrete(β) v m x, c m StringEdit(x cm ) This is the infinite limit of the graphical model shown in Figure 1 (left). This model follows the basic outline of the model of Carbonetto et al. [2005] for coreference of author and title mentions, except with a distortion model that is tuned for this task. Now we discuss sampling from this model. Inference is complicated by the fact that the distribution p(v m x cm ) is not conjugate to the prior over canonical strings p(x cm ); our general algorithm follows Neal [2000], but with a slightly different proposal. The state of the sampler is the set of all cluster indices {c m } and set the canonical strings {v m }. Let C be the number of clusters in the current state of the sampler; we assume that the clusters are numbered from 1... C. Each iteration of sampling has two steps. In the first step, we resample the cluster identities using a Metropolis-Hastings update. For every mention m {1... M}, we propose a new cluster c m from the distribution: p(c m c m, v) { p(c m c m )p(v m c m) if c m = c j for some j m p(c m c m )p(v m c m, x m = v m ) if c m {1... C} That is, if the proposed cluster is one that already has mentions in it, the proposal is proportional to the Gibbs sampling proposal. If the proposed cluster is a new cluster, then we assume that its canonical string is the new venue. Ideally, we would sample x c m in this case; Neal [2000] suggests using the prior p(x), but this would lead to a string that would hardly ever match v m. In the current setting, we are mainly interested in finding a high-probability configuration, which makes our choice seem reasonable. In the second step, we resample the canonical string x c conditioned on all of the cluster identities for each cluster by Gibbs sampling, but with the restriction that x c must be identical to one of the observed venue strings in the cluster. This restriction is a slight abuse, but seems to work well in practice. 2.3 DP Mixture over Venues and Titles Now we augment the model to include title mentions. This model jointly clusters venues and titles using a single set of latent variables that control a string-edit model for the venues and a Dirichlet-multinomial for the titles. Each venue c generates a distribution θ c over title words, and every mention m now generates all of its title word t mi by a discrete distribution with parameters θ cm. This model contains all of the factors in (3), and in addition: θ c Dirichlet(λ, 1) t mi c m, {θ c } Discrete(θ cm ) The graphical model for this is shown in Figure 1. (4) 4

5 Sampling for this model proceeds exactly as for the venue-only DP model, except that the proposal distribution incorporates the distribution over title words, integrating out θ c. This results in the proposal { p(c p(c m c m )p(v m c m c m, v) m)p(t m c m, t) if c m = c j for some j m p(c m c m )p(v m c m, x m = v m )p(t m c m) if c (5) m {1... C}, where p(t m c m, t) is the probability of the title mention t m being generated by the proposed cluster, conditioned on the titles already in that cluster. As with the venue strings, again the probability of a title string p(t m c m) can be computed, integrating out θ c using a Polya urn scheme: p(t m c m) = N(m) i=1 p(t mi t m1,..., t m,i 1 ) = N(m) i=1 N {tmi=t mj;j<i} + 1/λ, (6) i 1 where N {tmi=t mj;j<i} is the number of words in t m that precede word i and are identical to it. 2.4 DP Mixture over Venues, with Latent-Dirichlet Title Model In the first venue-title model, every venue has a distribution over title words. A more flexible model may be desired, both in order to discover title words that are strongly associated with particular venues, and in the hope that the more flexible title model will yield better performance. Therefore, in addition to considering a unigram title model, we also consider a version in which the titles are generated by latent Dirichlet allocation [Blei et al., 2003]. Currently, we dedicate exactly one topic for each venue, and then add one shared topic across all venues; more flexible choices are certainly possible. This model includes all of the factors of the venue-only DP model (3), and in addition: φ g Dirichlet(λ 0, 1) φ c Dirichlet(λ 1, 1) θ c Dirichlet(λ, 1) (7) z mi θ, c Discrete(θ cm ) { Discrete(φ g ) if z mi = 0 t mi φ g, φ c, z mi Discrete(φ c ) if z mi = 1 The corresponding graphical model is shown in Figure 1. To sample from this model, we make use of the factor that the semantics of the z m variables do not depend on the venue identity, so that it is reasonable to propose a move that changes c m but leaves z m unchanged. The state of this sampler contains the venue assignments {c m }, the canonical strings {x c }, and the title word topics z = {z mi }. The sampling algorithm is: 1. Initialize c m = m for all m. Initialize z by resampling all z mi for Z iterations of Gibbs sampling, leaving all c m and x m fixed. Repeat: 2. Resample all c m using Metropolis-Hastings with the proposal distribution: { p(c p(c m c m )p(v m c m c m, v, z) m)p(t m c m, t m, z m ) if c m = c j for some j m p(c m c m )p(v m c m, x m = v m )p(t m c m, z m ) if c m {1... C}, (8) 3. Resample all z mi using Gibbs sampling. 5

6 3 Related Work A large amount of recent work has focused on DP mixture models, generative models of coreference, and their application to modeling the scientific literature. The general framework of using a per-cluster mixture model for coreference of research papers was introduced by Pasula et al. [2003]. A more detailed description of a similar model is given by Milch [2006]. These models generate the number of venues from a log normal distribution. A variant of this model which models the venue assignments with a hierarchical DP was reported by Carbonetto et al. [2005], although they do not report a comparison with the log normal model. All of these approaches model title corference and author corference jointly, but they do not consider venue coreference. All of these models generate canonical titles and author names independently. The key contribution of our work is to explicitly model how the distribution over titles depends on the venue, and to show that this leads to better performance on venue coreference. Another related model is Haghighi and Klein [2007], which applies DP mixtures to nounphrase corference, which is the problem of determining which noun phrases in a document refer to the same entity, such as George W. Bush and he. This work is in similar spirit to ours, in that it augments the basic DP mixture with additional variables tailored to specific coreference task. However, noun-phrase coreference involves different phenomenon from research paper coreference, such as prounouns and anaphoric references like the president. Thus the models of Haghighi and Klein are substantially different from the ones we propose here; for example, they include models of pronoun gender, which are not relevant here. Within coreference, another class of state-of-the-art models are pairwise conditional random fields [McCallum and Wellner, 2005]. This model has been used for large-scale paper and author corference in the Rexa digital library 1. In addition, Culotta and McCallum [2005] have applied these models to venue coreference, finding that jointly modeling coreference of other field types substantially improves performance. These models are all supervised, so our approaches have the advantage of not requiring labeled training data, although they can readily exploit labeled corference data if it is available. 4 Experiments In order to evaluate the benefit of modeling the correlation between venues and the titles of papers that appear therein, we compared the performance of the various DP mixture models on a set of citations. We first obtained a list of automatically extracted citations from the Rexa database. These were citations that had been generated via a conditional random field, used to segment the output of a program that converts postscript documents into plaintext. This process is inherantly imperfect, and therefore the fields in the extracted strings occasionally contain noise such as typographical errors, or extraneous tokens. The citations were mapped onto venue-title pairs, and duplicate citations (those that were string identical in both fields) were collapsed. We chose a dozen venues on which to test the models, and assembled a corpus consisting of about 200 citations per venue. The venues covered a range of topics including: artificial intelligence, machine learning, computational physics, biology, the semantic web, and wearable computers. Each venue was represented in the corpus by several diverse mention strings. Additionally, there were two venues represented in the corpus, which shared a mention string. After removal of certain stop-words and punctuation symbols from the venue fields, we were left with 262 unique venue strings. We then expanded any mention that consisted entirely of a string of capital letters, so that each capital letter would be treated as a word. This allowed the distortion model to more easily align acronyms with their full names. We then compared the performance of four models on this data set: 1 6

7 DPV: The Dirichlet process mixture model over distorted venue strings (see Section 2.2). DPVT: The DPV model augmented to model distributions of titles (see Section 2.3). DPVL: The DPVT model extended to include a global unigram distribution, so that titles are modeled as mixtures of unigrams from the global distribution and the venue specific one (see Section 2.4). STR: A simple heuristic intended to ascertain the difficulty of coreference on this data set. This model predicts coreference between mentions only when the venue strings are identical (after preprocessing). For each of these models we report the best performance we could obtain after a scan through the range of parameter settings. For each of the generative models we performed 1000 iterations of the Metropolis-Hastings sampling algorithm, an iteration consisted of re-sampling each c m, x cm and z m (where applicable) in turn. 4.1 B 3 Evaluation Metric We employ the B 3 metric of Amit and Baldwin [1998] to evaluate the performance of the systems. For each mention m i let c i be the set of predicted coreferent mentions, and t i be the set of truly coreferent mentions. The precision for m i is the number of mentions that appear in both c i and t i divided by the number of mentions in t i. The recall is the size of c i divided by the size of t i. These are averaged over all mentions in the corpus to obtain a single pair of precision and recall numbers. The F1 is the harmonic mean of the precision and recall. 4.2 Results Coreference performance for each of the four systems is shown in Table 1. The baseline STR heuristic demonstrate the difficulty of performing coreference on this dataset: string identical mentions are not necessarily coreferent, and different strings often refer to the same venue. The former is evidenced by the model not obtaining 100% precision. The best performance overall is obtained by the DPVT system, where we set the concentration over title unigram distributions to be λ = 0.9. This setting has the effect of favoring more peaked unigram distributions over title words, where the peaks correspond to words particular to that cluster. These results demonstrate a marked improvement in coreference performance by modeling the titles of the papers. F1 is increased from 63.8% to 84.9% by adding title modeling to the DP mixture, a 58% error reduction. The DPVL model has slightly higher precision than the DVPT model, but at a high cost to recall. It is plausible that adding more globally shared topics to the mixture would mitigate this effect. The data set contained two venues which shared the name ISWC (for International Semantic Web Conference and International Symposium on Wearable Computing), which the DPVL model is able to disambiguate more accurately due to its more particular distributions over title words, as shown in Table 3. The standard Dirchlet process mixture, on the other hand, will almost always merge identical venue strings. Shown in Table 2 are some example per-venue distributions over words that were generated by the DPVL model. While common words such as a and for are highly weighted in these clusters, so are the less frequent but more topical words. 5 Conclusions and Future Work We present an unsupervised nonparametric Bayesian model for coreference of research venues. Although related models have been applied to coreference of paper titles and authors, research venues have several unique characteristics that warrant special modeling. 7

8 Model Precision Recall F1 DPV DPVT DPVL STR Table 1: Percent B 3 venue coreference performance for the four systems. ICDAR J. Comput. Phys. Intl. Symp. Wearable Comp. a equations realtime document for positioning recognition numerical a handwritten and system on in wearable and method novel using with sensing of of for for the computers line a personal Table 2: The highest weighted words in three of the per-cluster multinomial unigram distributions. Note that ICDAR is the International Conference on Document Analysis and Recognition. Intl. Symp. on Wearable Comp. Realtime Personal Positioning System for Wearable Computers. Acceleration Sensing Glove (ASG). Intl. Semantic Web Conf. Benchmarking DAML+OIL Repositories TRIPLE - A Query, Inference, and Transformation Language for the Semantic Web. Table 3: Examples of ambiguous acronyms that are correctly disambiguated by the DPTL model. Shown in bold are the canonical strings for the clusters. All of these mentions have the venue string ISWC. 8

9 By exploiting the fact that research venues have a characteristic distribution over titles, we obtain a dramatic increase in performance on venue coreference. In particular, the model is even able to accurately split up venues that have string-identical abbreviations. Several directions are available for future work. First, if labeled training data is available, then this model readily lends itself to semi-supervised prediction. This could be necessary to match the performance of discriminative coreference systems. Second, Culotta and Mc- Callum [2005] show in the citations domain that joint coreference of multiple field types can improve performance. Our current model performs language modeling of titles, but does not include coreference of titles. It is possible that extending this model to include paper and author coreference would further improve performance. Acknowledgments This work was supported in part by the Center for Intelligent Information Retrieval, in part by U.S. Government contract #NBCH through a subcontract with BBNT Solutions LLC, in part by The Central Intelligence Agency, the National Security Agency and National Science Foundation under NSF grant #IIS , and in part by The Central Intelligence Agency, the National Security Agency and National Science Foundation under NSF grant #IIS Any opinions, findings and conclusions or recommendations expressed in this material are the authors and do not necessarily reflect those of the sponsor. References B. Amit and B. Baldwin. Algorithms for scoring coreference chains. In Proceedings of MUC7, David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993, Peter Carbonetto, Jacek Kisynski, Nando de Freitas, and David Poole. Bayesian logic. In UAI, Nonparametric Aron Culotta and Andrew McCallum. Joint deduplication of multiple record types in relational data. In CIKM, pages , Aria Haghighi and Dan Klein. Unsupervised coreference resolution in a nonparametric Bayesian model. In ACL, Andrew McCallum and Ben Wellner. Conditional models of identity uncertainty with application to noun coreference. In Lawrence K. Saul, Yair Weiss, and Léon Bottou, editors, Advances in Neural Information Processing Systems 17, pages MIT Press, Cambridge, MA, Brian Milch. Probabilistic Models with Unknown Objects. PhD thesis, University of California, Berkeley, R. M Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics, 9: , Hanna M. Pasula, Bhaskara Marthi, Brian Milch, Stuart Russell, and Ilya Shpitser. Identity uncertainty and citation matching. In NIPS, Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 101(476): ,

Unsupervised Deduplication using Cross-Field Dependencies

Unsupervised Deduplication using Cross-Field Dependencies Robert Hall Department of Computer Science University of Massachusetts Amherst, MA 01003 rhall@cs.umass.edu Charles Sutton Department of Computer