Estimating document frequencies in a speech corpus

Damianos Karakos, Mark Dredze, Ken Church, Aren Jansen, Sanjeev Khudanpur

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Inverse Document Frequency (IDF) is an important quantity in many applications, including Information Retrieval. IDF is defined in terms of document frequency, df (w), the number of documents that mention w at least once. This quantity is relatively easy to compute over textual documents, but spoken documents are more challenging. This paper considers two baselines: (1) an estimate based on the 1-best ASR output and (2) an estimate based on expected term frequencies computed from the lattice. We improve over these baselines by taking advantage of repetition. Whatever the document is about is likely to be repeated, unlike ASR errors, which tend to be more random (Poisson). In addition, we find it helpful to consider an ensemble of language models. There is an opportunity for the ensemble to reduce noise, assuming that the errors across language models are relatively uncorrelated. The opportunity for improvement is larger when WER is high. This paper considers a pairing task application that could benefit from improved estimates of df. The pairing task inputs conversational sides from the English Fisher corpus and outputs estimates of which sides were from the same conversation. Better estimates of df lead to better performance on this task.

Original languageEnglish (US)
Title of host publication2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings
Pages407-412
Number of pages6
DOIs
StatePublished - 2011
Externally publishedYes
Event2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011 - Waikoloa, HI, United States
Duration: Dec 11 2011Dec 15 2011

Publication series

Name2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings

Conference

Conference2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011
Country/TerritoryUnited States
CityWaikoloa, HI
Period12/11/1112/15/11

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Vision and Pattern Recognition
  • Human-Computer Interaction

Fingerprint

Dive into the research topics of 'Estimating document frequencies in a speech corpus'. Together they form a unique fingerprint.

Cite this