Abstract
Previous research in cross-document entity coreference has generally been restricted to the offline scenario where the set of documents is provided in advance. As a consequence, the dominant approach is based on greedy agglomerative clustering techniques that utilize pairwise vector comparisons and thus require O(n2) space and time. In this paper we explore identifying coreferent entity mentions across documents in high-volume streaming text, including methods for utilizing orthographic and contextual information. We test our methods using several corpora to quantitatively measure both the efficacy and scalability of our streaming approach. We show that our approach scales to at least an order of magnitude larger data than previous reported methods.
Original language | English (US) |
---|---|
Pages | 1050-1058 |
Number of pages | 9 |
State | Published - 2010 |
Externally published | Yes |
Event | 23rd International Conference on Computational Linguistics, Coling 2010 - Beijing, China Duration: Aug 23 2010 → Aug 27 2010 |
Conference
Conference | 23rd International Conference on Computational Linguistics, Coling 2010 |
---|---|
Country/Territory | China |
City | Beijing |
Period | 8/23/10 → 8/27/10 |
ASJC Scopus subject areas
- Language and Linguistics
- Computational Theory and Mathematics
- Linguistics and Language