Developing a corpus of clinical notes manually annotated for part-of-speech

Serguei V. Pakhomov, Anni Coden, Christopher G. Chute

Research output: Contribution to journalArticlepeer-review

28 Scopus citations


Purpose: This paper presents a project whose main goal is to construct a corpus of clinical text manually annotated for part-of-speech (POS) information. We describe and discuss the process of training three domain experts to perform linguistic annotation. Methods: Three domain experts were trained to perform manual annotation of a corpus of clinical notes. A part of this corpus was combined with the Penn Treebank corpus of general purpose English text and another part was set aside for testing. The corpora were then used for training and testing statistical part-of-speech taggers. We list some of the challenges as well as encouraging results pertaining to inter-rater agreement and consistency of annotation. Results: We used the Trigrams'n'Tags (TnT) [T. Brants, TnT-a statistical part-of-speech tagger, In: Proceedings of NAACL/ANLP-2000 Symposium, 2000] tagger trained on general English data to achieve 89.79% correctness. The same tagger trained on a portion of the medical data annotated for this project improved the performance to 94.69%. Furthermore, we find that discriminating between different types of discourse represented by different sections of clinical text may be very beneficial to improve correctness of POS tagging. Conclusion: Our preliminary experimental results indicate the necessity for adapting state-of-the-art POS taggers to the sublanguage domain of clinical text.

Original languageEnglish (US)
Pages (from-to)418-429
Number of pages12
JournalInternational Journal of Medical Informatics
Issue number6
StatePublished - Jun 2006
Externally publishedYes


  • Domain adaptation
  • Manual text annotation
  • Medical domain
  • Natural language processing
  • Statistical part-of-speech tagging
  • Text analysis

ASJC Scopus subject areas

  • Health Informatics


Dive into the research topics of 'Developing a corpus of clinical notes manually annotated for part-of-speech'. Together they form a unique fingerprint.

Cite this