Name phylogeny: A generative model of string variation

Nicholas Andrews, Jason Eisner, Mark Dredze

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Many linguistic and textual processes involve transduction of strings. We show how to learn a stochastic transducer from an unorganized collection of strings (rather than string pairs). The role of the transducer is to organize the collection. Our generative model explains similarities among the strings by supposing that some strings in the collection were not generated ab initio, but were instead derived by transduction from other, "similar" strings in the collection. Our variational EM learning algorithm alternately reestimates this phylogeny and the transducer parameters. The final learned transducer can quickly link any test name into the final phylogeny, thereby locating variants of the test name. We find that our method can effectively find name variants in a corpus of web strings used to refer to persons in Wikipedia, improving over standard untrained distances such as Jaro-Winkler and Leven-shtein distance.

Original languageEnglish (US)
Title of host publicationEMNLP-CoNLL 2012 - 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Proceedings of the Conference
Pages344-355
Number of pages12
StatePublished - 2012
Externally publishedYes
Event2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012 - Jeju Island, Korea, Republic of
Duration: Jul 12 2012Jul 14 2012

Publication series

NameEMNLP-CoNLL 2012 - 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Proceedings of the Conference

Conference

Conference2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012
Country/TerritoryKorea, Republic of
CityJeju Island
Period7/12/127/14/12

ASJC Scopus subject areas

  • Software

Fingerprint

Dive into the research topics of 'Name phylogeny: A generative model of string variation'. Together they form a unique fingerprint.

Cite this