Learning sub-word units for open vocabulary speech recognition

Carolina Parada, Mark Dredze, Abhinav Sethy, Ariya Rastrow

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Large vocabulary speech recognition systems fail to recognize words beyond their vocabulary, many of which are information rich terms, like named entities or foreign words. Hybrid word/sub-word systems solve this problem by adding sub-word units to large vocabulary word based systems; new words can then be represented by combinations of subword units. Previous work heuristically created the sub-word lexicon from phonetic representations of text using simple statistics to select common phone sequences. We propose a probabilistic model to learn the subword lexicon optimized for a given task. We consider the task of out of vocabulary (OOV) word detection, which relies on output from a hybrid model. A hybrid model with our learned sub-word lexicon reduces error by 6.3% and 7.6% (absolute) at a 5% false alarm rate on an English Broadcast News and MIT Lectures task respectively.

Original languageEnglish (US)
Title of host publicationACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics
Subtitle of host publicationHuman Language Technologies
Pages712-721
Number of pages10
StatePublished - 2011
Externally publishedYes
Event49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT 2011 - Portland, OR, United States
Duration: Jun 19 2011Jun 24 2011

Publication series

NameACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies
Volume1

Conference

Conference49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, ACL-HLT 2011
Country/TerritoryUnited States
CityPortland, OR
Period6/19/116/24/11

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Learning sub-word units for open vocabulary speech recognition'. Together they form a unique fingerprint.

Cite this