Abstract
Accurate and reliable part-of-speech tagging is useful for many Natural Language Processing (NLP) tasks that form the foundation of NLP-based approaches to information retrieval and data mining. In general, large annotated corpora are necessary to achieve desired part-of-speech tagger accuracy. We show that a large annotated general-English corpus is not sufficient for building a part-of-speech tagger model adequate for tagging documents from the medical domain. However, adding a quite small domain-specific corpus to a large general-English one boosts performance to over 92% accuracy from 87% in our studies. We also suggest a number of characteristics to quantify the similarities between a training corpus and the test data. These results give guidance for creating an appropriate corpus for building a part-of-speech tagger model that gives satisfactory accuracy results on a new domain at a relatively small cost.
Original language | English (US) |
---|---|
Pages (from-to) | 422-430 |
Number of pages | 9 |
Journal | Journal of Biomedical Informatics |
Volume | 38 |
Issue number | 6 |
DOIs | |
State | Published - Dec 2005 |
Externally published | Yes |
Keywords
- Biomedical domain
- Clinical information systems
- Clinical report analysis
- Corpus linguistics
- Domain adaptation
- Hidden Markov Model
- Part-of-speech tagging accuracy
- Statistical part-of-speech tagging
ASJC Scopus subject areas
- Computer Science Applications
- Health Informatics