A spoken term detection framework for recovering out-of-vocabulary words using the web

Carolina Parada, Abhinav Sethy, Mark Dredze, Frederick Jelinek

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Vocabulary restrictions in large vocabulary continuous speech recognition (LVCSR) systems mean that out-of-vocabulary (OOV) words are lost in the output. However, OOV words tend to be information rich terms (often named entities) and their omission from the transcript negatively affects both usability and downstream NLP technologies, such as machine translation or knowledge distillation. We propose a novel approach to OOV recovery that uses a spoken term detection (STD) framework. Given an identified OOV region in the LVCSR output, we recover the uttered OOVs by utilizing contextual information and the vast and constantly updated vocabulary on the Web. Discovered words are integrated into system output, recovering up to 40% of OOVs and resulting in a reduction in system error.

Original languageEnglish (US)
Title of host publicationProceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010
PublisherInternational Speech Communication Association
Pages1269-1272
Number of pages4
StatePublished - 2010
Externally publishedYes

Publication series

NameProceedings of the 11th Annual Conference of the International Speech Communication Association, INTERSPEECH 2010

Keywords

  • Data selection
  • Language modeling
  • Oov detection
  • Spoken term detection

ASJC Scopus subject areas

  • Language and Linguistics
  • Speech and Hearing
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Fingerprint

Dive into the research topics of 'A spoken term detection framework for recovering out-of-vocabulary words using the web'. Together they form a unique fingerprint.

Cite this