TY - GEN
T1 - Estimating document frequencies in a speech corpus
AU - Karakos, Damianos
AU - Dredze, Mark
AU - Church, Ken
AU - Jansen, Aren
AU - Khudanpur, Sanjeev
PY - 2011
Y1 - 2011
N2 - Inverse Document Frequency (IDF) is an important quantity in many applications, including Information Retrieval. IDF is defined in terms of document frequency, df (w), the number of documents that mention w at least once. This quantity is relatively easy to compute over textual documents, but spoken documents are more challenging. This paper considers two baselines: (1) an estimate based on the 1-best ASR output and (2) an estimate based on expected term frequencies computed from the lattice. We improve over these baselines by taking advantage of repetition. Whatever the document is about is likely to be repeated, unlike ASR errors, which tend to be more random (Poisson). In addition, we find it helpful to consider an ensemble of language models. There is an opportunity for the ensemble to reduce noise, assuming that the errors across language models are relatively uncorrelated. The opportunity for improvement is larger when WER is high. This paper considers a pairing task application that could benefit from improved estimates of df. The pairing task inputs conversational sides from the English Fisher corpus and outputs estimates of which sides were from the same conversation. Better estimates of df lead to better performance on this task.
AB - Inverse Document Frequency (IDF) is an important quantity in many applications, including Information Retrieval. IDF is defined in terms of document frequency, df (w), the number of documents that mention w at least once. This quantity is relatively easy to compute over textual documents, but spoken documents are more challenging. This paper considers two baselines: (1) an estimate based on the 1-best ASR output and (2) an estimate based on expected term frequencies computed from the lattice. We improve over these baselines by taking advantage of repetition. Whatever the document is about is likely to be repeated, unlike ASR errors, which tend to be more random (Poisson). In addition, we find it helpful to consider an ensemble of language models. There is an opportunity for the ensemble to reduce noise, assuming that the errors across language models are relatively uncorrelated. The opportunity for improvement is larger when WER is high. This paper considers a pairing task application that could benefit from improved estimates of df. The pairing task inputs conversational sides from the English Fisher corpus and outputs estimates of which sides were from the same conversation. Better estimates of df lead to better performance on this task.
UR - http://www.scopus.com/inward/record.url?scp=84858983151&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84858983151&partnerID=8YFLogxK
U2 - 10.1109/ASRU.2011.6163966
DO - 10.1109/ASRU.2011.6163966
M3 - Conference contribution
AN - SCOPUS:84858983151
SN - 9781467303675
T3 - 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings
SP - 407
EP - 412
BT - 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011, Proceedings
T2 - 2011 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2011
Y2 - 11 December 2011 through 15 December 2011
ER -