Using Probabilistic Record Linkage of Structured and Unstructured Data to Identify Duplicate Cases in Spontaneous Adverse Event Reporting Systems

Kory Kreimeyer; David Menschik; Scott Winiecki; Wendy Paul; Faith Barash; Emily Jane Woo; Meghna Alimchandani; Deepa Arya; Craig Zinderman; Richard Forshee; Taxiarchis Botsis

doi:10.1007/s40264-017-0523-4

Using Probabilistic Record Linkage of Structured and Unstructured Data to Identify Duplicate Cases in Spontaneous Adverse Event Reporting Systems

Kory Kreimeyer, David Menschik, Scott Winiecki, Wendy Paul, Faith Barash, Emily Jane Woo, Meghna Alimchandani, Deepa Arya, Craig Zinderman, Richard Forshee, Taxiarchis Botsis

Research output: Contribution to journal › Article › peer-review

4 Scopus citations

Abstract

Introduction: Duplicate case reports in spontaneous adverse event reporting systems pose a challenge for medical reviewers to efficiently perform individual and aggregate safety analyses. Duplicate cases can bias data mining by generating spurious signals of disproportional reporting of product-adverse event pairs. Objective: We have developed a probabilistic record linkage algorithm for identifying duplicate cases in the US Vaccine Adverse Event Reporting System (VAERS) and the US Food and Drug Administration Adverse Event Reporting System (FAERS). Methods: In addition to using structured field data, the algorithm incorporates the non-structured narrative text of adverse event reports by examining clinical and temporal information extracted by the Event-based Text-mining of Health Electronic Records system, a natural language processing tool. The final component of the algorithm is a novel duplicate confidence value that is calculated by a rule-based empirical approach that looks for similarities in a number of criteria between two case reports. Results: For VAERS, the algorithm identified 77% of known duplicate pairs with a precision (or positive predictive value) of 95%. For FAERS, it identified 13% of known duplicate pairs with a precision of 100%. The textual information did not improve the algorithm’s automated classification for VAERS or FAERS. The empirical duplicate confidence value increased performance on both VAERS and FAERS, mainly by reducing the occurrence of false-positives. Conclusions: The algorithm was shown to be effective at identifying pre-linked duplicate VAERS reports. The narrative text was not shown to be a key component in the automated detection evaluation; however, it is essential for supporting the semi-automated approach that is likely to be deployed at the Food and Drug Administration, where medical reviewers will perform some manual review of the most highly ranked reports identified by the algorithm.

Original language	English (US)
Pages (from-to)	571-582
Number of pages	12
Journal	Drug Safety
Volume	40
Issue number	7
DOIs	https://doi.org/10.1007/s40264-017-0523-4
State	Published - Jul 1 2017
Externally published	Yes

ASJC Scopus subject areas

Toxicology
Pharmacology
Pharmacology (medical)

Access to Document

10.1007/s40264-017-0523-4

Cite this

Kreimeyer, K., Menschik, D., Winiecki, S., Paul, W., Barash, F., Woo, E. J., Alimchandani, M., Arya, D., Zinderman, C., Forshee, R., & Botsis, T. (2017). Using Probabilistic Record Linkage of Structured and Unstructured Data to Identify Duplicate Cases in Spontaneous Adverse Event Reporting Systems. Drug Safety, 40(7), 571-582. https://doi.org/10.1007/s40264-017-0523-4

Kreimeyer, K, Menschik, D, Winiecki, S, Paul, W, Barash, F, Woo, EJ, Alimchandani, M, Arya, D, Zinderman, C, Forshee, R & Botsis, T 2017, 'Using Probabilistic Record Linkage of Structured and Unstructured Data to Identify Duplicate Cases in Spontaneous Adverse Event Reporting Systems', Drug Safety, vol. 40, no. 7, pp. 571-582. https://doi.org/10.1007/s40264-017-0523-4

@article{146347071fdd456085047525351198e5,

title = "Using Probabilistic Record Linkage of Structured and Unstructured Data to Identify Duplicate Cases in Spontaneous Adverse Event Reporting Systems",

abstract = "Introduction: Duplicate case reports in spontaneous adverse event reporting systems pose a challenge for medical reviewers to efficiently perform individual and aggregate safety analyses. Duplicate cases can bias data mining by generating spurious signals of disproportional reporting of product-adverse event pairs. Objective: We have developed a probabilistic record linkage algorithm for identifying duplicate cases in the US Vaccine Adverse Event Reporting System (VAERS) and the US Food and Drug Administration Adverse Event Reporting System (FAERS). Methods: In addition to using structured field data, the algorithm incorporates the non-structured narrative text of adverse event reports by examining clinical and temporal information extracted by the Event-based Text-mining of Health Electronic Records system, a natural language processing tool. The final component of the algorithm is a novel duplicate confidence value that is calculated by a rule-based empirical approach that looks for similarities in a number of criteria between two case reports. Results: For VAERS, the algorithm identified 77% of known duplicate pairs with a precision (or positive predictive value) of 95%. For FAERS, it identified 13% of known duplicate pairs with a precision of 100%. The textual information did not improve the algorithm{\textquoteright}s automated classification for VAERS or FAERS. The empirical duplicate confidence value increased performance on both VAERS and FAERS, mainly by reducing the occurrence of false-positives. Conclusions: The algorithm was shown to be effective at identifying pre-linked duplicate VAERS reports. The narrative text was not shown to be a key component in the automated detection evaluation; however, it is essential for supporting the semi-automated approach that is likely to be deployed at the Food and Drug Administration, where medical reviewers will perform some manual review of the most highly ranked reports identified by the algorithm.",

author = "Kory Kreimeyer and David Menschik and Scott Winiecki and Wendy Paul and Faith Barash and Woo, {Emily Jane} and Meghna Alimchandani and Deepa Arya and Craig Zinderman and Richard Forshee and Taxiarchis Botsis",

note = "Funding Information: The authors thank Ezekiel Maier for several conversations and suggestions that have enhanced the technical aspects of this work. This work was supported in part by the appointment of Kory Kreimeyer to the Research Participation Program administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the US Department of Energy and the US Food and Drug Administration. Kory Kreimeyer, David Menschik, Scott Winiecki, Wendy Paul, Faith Barash, Emily Jane Woo, Meghna Alimchandani, Deepa Arya, Craig Zinderman, Richard Forshee, and Taxiarchis Botsis have no conflicts of interest directly relevant to the content of this article. Publisher Copyright: {\textcopyright} 2017, Springer International Publishing Switzerland 2017(outside the USA).",

year = "2017",

month = jul,

day = "1",

doi = "10.1007/s40264-017-0523-4",

language = "English (US)",

volume = "40",

pages = "571--582",

journal = "Drug Safety",

issn = "0114-5916",

publisher = "Adis International Ltd",

number = "7",

}

TY - JOUR

T1 - Using Probabilistic Record Linkage of Structured and Unstructured Data to Identify Duplicate Cases in Spontaneous Adverse Event Reporting Systems

AU - Kreimeyer, Kory

AU - Menschik, David

AU - Winiecki, Scott

AU - Paul, Wendy

AU - Barash, Faith

AU - Woo, Emily Jane

AU - Alimchandani, Meghna

AU - Arya, Deepa

AU - Zinderman, Craig

AU - Forshee, Richard

AU - Botsis, Taxiarchis

N1 - Funding Information: The authors thank Ezekiel Maier for several conversations and suggestions that have enhanced the technical aspects of this work. This work was supported in part by the appointment of Kory Kreimeyer to the Research Participation Program administered by the Oak Ridge Institute for Science and Education through an interagency agreement between the US Department of Energy and the US Food and Drug Administration. Kory Kreimeyer, David Menschik, Scott Winiecki, Wendy Paul, Faith Barash, Emily Jane Woo, Meghna Alimchandani, Deepa Arya, Craig Zinderman, Richard Forshee, and Taxiarchis Botsis have no conflicts of interest directly relevant to the content of this article. Publisher Copyright: © 2017, Springer International Publishing Switzerland 2017(outside the USA).

PY - 2017/7/1

Y1 - 2017/7/1

N2 - Introduction: Duplicate case reports in spontaneous adverse event reporting systems pose a challenge for medical reviewers to efficiently perform individual and aggregate safety analyses. Duplicate cases can bias data mining by generating spurious signals of disproportional reporting of product-adverse event pairs. Objective: We have developed a probabilistic record linkage algorithm for identifying duplicate cases in the US Vaccine Adverse Event Reporting System (VAERS) and the US Food and Drug Administration Adverse Event Reporting System (FAERS). Methods: In addition to using structured field data, the algorithm incorporates the non-structured narrative text of adverse event reports by examining clinical and temporal information extracted by the Event-based Text-mining of Health Electronic Records system, a natural language processing tool. The final component of the algorithm is a novel duplicate confidence value that is calculated by a rule-based empirical approach that looks for similarities in a number of criteria between two case reports. Results: For VAERS, the algorithm identified 77% of known duplicate pairs with a precision (or positive predictive value) of 95%. For FAERS, it identified 13% of known duplicate pairs with a precision of 100%. The textual information did not improve the algorithm’s automated classification for VAERS or FAERS. The empirical duplicate confidence value increased performance on both VAERS and FAERS, mainly by reducing the occurrence of false-positives. Conclusions: The algorithm was shown to be effective at identifying pre-linked duplicate VAERS reports. The narrative text was not shown to be a key component in the automated detection evaluation; however, it is essential for supporting the semi-automated approach that is likely to be deployed at the Food and Drug Administration, where medical reviewers will perform some manual review of the most highly ranked reports identified by the algorithm.

AB - Introduction: Duplicate case reports in spontaneous adverse event reporting systems pose a challenge for medical reviewers to efficiently perform individual and aggregate safety analyses. Duplicate cases can bias data mining by generating spurious signals of disproportional reporting of product-adverse event pairs. Objective: We have developed a probabilistic record linkage algorithm for identifying duplicate cases in the US Vaccine Adverse Event Reporting System (VAERS) and the US Food and Drug Administration Adverse Event Reporting System (FAERS). Methods: In addition to using structured field data, the algorithm incorporates the non-structured narrative text of adverse event reports by examining clinical and temporal information extracted by the Event-based Text-mining of Health Electronic Records system, a natural language processing tool. The final component of the algorithm is a novel duplicate confidence value that is calculated by a rule-based empirical approach that looks for similarities in a number of criteria between two case reports. Results: For VAERS, the algorithm identified 77% of known duplicate pairs with a precision (or positive predictive value) of 95%. For FAERS, it identified 13% of known duplicate pairs with a precision of 100%. The textual information did not improve the algorithm’s automated classification for VAERS or FAERS. The empirical duplicate confidence value increased performance on both VAERS and FAERS, mainly by reducing the occurrence of false-positives. Conclusions: The algorithm was shown to be effective at identifying pre-linked duplicate VAERS reports. The narrative text was not shown to be a key component in the automated detection evaluation; however, it is essential for supporting the semi-automated approach that is likely to be deployed at the Food and Drug Administration, where medical reviewers will perform some manual review of the most highly ranked reports identified by the algorithm.

UR - http://www.scopus.com/inward/record.url?scp=85015203844&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85015203844&partnerID=8YFLogxK

U2 - 10.1007/s40264-017-0523-4

DO - 10.1007/s40264-017-0523-4

M3 - Article

C2 - 28293864

AN - SCOPUS:85015203844

SN - 0114-5916

VL - 40

SP - 571

EP - 582

JO - Drug Safety

JF - Drug Safety

IS - 7

ER -

Using Probabilistic Record Linkage of Structured and Unstructured Data to Identify Duplicate Cases in Spontaneous Adverse Event Reporting Systems

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this