Evaluating the harmonisation potential of diverse cohort datasets

Sarah Bauermeister; Mukta Phatak; Kelly Sparks; Lana Sargent; Michael Griswold; Caitlin McHugh; Mike Nalls; Simon Young; Joshua Bauermeister; Paul Elliott; Andrew Steptoe; David Porteous; Carole Dufouil; John Gallacher

doi:10.1007/s10654-023-00997-3

Evaluating the harmonisation potential of diverse cohort datasets

Sarah Bauermeister, Mukta Phatak, Kelly Sparks, Lana Sargent, Michael Griswold, Caitlin McHugh, Mike Nalls, Simon Young, Joshua Bauermeister, Paul Elliott, Andrew Steptoe, David Porteous, Carole Dufouil, John Gallacher

Research output: Contribution to journal › Article › peer-review

Abstract

Data discovery, the ability to find datasets relevant to an analysis, increases scientific opportunity, improves rigour and accelerates activity. Rapid growth in the depth, breadth, quantity and availability of data provides unprecedented opportunities and challenges for data discovery. A potential tool for increasing the efficiency of data discovery, particularly across multiple datasets is data harmonisation.A set of 124 variables, identified as being of broad interest to neurodegeneration, were harmonised using the C-Surv data model. Harmonisation strategies used were simple calibration, algorithmic transformation and standardisation to the Z-distribution. Widely used data conventions, optimised for inclusiveness rather than aetiological precision, were used as harmonisation rules. The harmonisation scheme was applied to data from four diverse population cohorts.Of the 120 variables that were found in the datasets, correspondence between the harmonised data schema and cohort-specific data models was complete or close for 111 (93%). For the remainder, harmonisation was possible with a marginal a loss of granularity.Although harmonisation is not an exact science, sufficient comparability across datasets was achieved to enable data discovery with relatively little loss of informativeness. This provides a basis for further work extending harmonisation to a larger variable list, applying the harmonisation to further datasets, and incentivising the development of data discovery tools.

Original language	English (US)
Pages (from-to)	605-615
Number of pages	11
Journal	European Journal of Epidemiology
Volume	38
Issue number	6
DOIs	https://doi.org/10.1007/s10654-023-00997-3
State	Published - Jun 2023
Externally published	Yes

Keywords

C-surv data model
Cohort
Data discovery
Data harmonisation
Data visualisation
Datasets

ASJC Scopus subject areas

Epidemiology

Access to Document

10.1007/s10654-023-00997-3

Cite this

@article{2282fbd0787f4438a6712150d9aea27b,

title = "Evaluating the harmonisation potential of diverse cohort datasets",

abstract = "Data discovery, the ability to find datasets relevant to an analysis, increases scientific opportunity, improves rigour and accelerates activity. Rapid growth in the depth, breadth, quantity and availability of data provides unprecedented opportunities and challenges for data discovery. A potential tool for increasing the efficiency of data discovery, particularly across multiple datasets is data harmonisation.A set of 124 variables, identified as being of broad interest to neurodegeneration, were harmonised using the C-Surv data model. Harmonisation strategies used were simple calibration, algorithmic transformation and standardisation to the Z-distribution. Widely used data conventions, optimised for inclusiveness rather than aetiological precision, were used as harmonisation rules. The harmonisation scheme was applied to data from four diverse population cohorts.Of the 120 variables that were found in the datasets, correspondence between the harmonised data schema and cohort-specific data models was complete or close for 111 (93%). For the remainder, harmonisation was possible with a marginal a loss of granularity.Although harmonisation is not an exact science, sufficient comparability across datasets was achieved to enable data discovery with relatively little loss of informativeness. This provides a basis for further work extending harmonisation to a larger variable list, applying the harmonisation to further datasets, and incentivising the development of data discovery tools.",

keywords = "C-surv data model, Cohort, Data discovery, Data harmonisation, Data visualisation, Datasets",

author = "Sarah Bauermeister and Mukta Phatak and Kelly Sparks and Lana Sargent and Michael Griswold and Caitlin McHugh and Mike Nalls and Simon Young and Joshua Bauermeister and Paul Elliott and Andrew Steptoe and David Porteous and Carole Dufouil and John Gallacher",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s).",

year = "2023",

month = jun,

doi = "10.1007/s10654-023-00997-3",

language = "English (US)",

volume = "38",

pages = "605--615",

journal = "European Journal of Epidemiology",

issn = "0393-2990",

publisher = "Springer Netherlands",

number = "6",

}

TY - JOUR

T1 - Evaluating the harmonisation potential of diverse cohort datasets

AU - Bauermeister, Sarah

AU - Phatak, Mukta

AU - Sparks, Kelly

AU - Sargent, Lana

AU - Griswold, Michael

AU - McHugh, Caitlin

AU - Nalls, Mike

AU - Young, Simon

AU - Bauermeister, Joshua

AU - Elliott, Paul

AU - Steptoe, Andrew

AU - Porteous, David

AU - Dufouil, Carole

AU - Gallacher, John

PY - 2023/6

Y1 - 2023/6

N2 - Data discovery, the ability to find datasets relevant to an analysis, increases scientific opportunity, improves rigour and accelerates activity. Rapid growth in the depth, breadth, quantity and availability of data provides unprecedented opportunities and challenges for data discovery. A potential tool for increasing the efficiency of data discovery, particularly across multiple datasets is data harmonisation.A set of 124 variables, identified as being of broad interest to neurodegeneration, were harmonised using the C-Surv data model. Harmonisation strategies used were simple calibration, algorithmic transformation and standardisation to the Z-distribution. Widely used data conventions, optimised for inclusiveness rather than aetiological precision, were used as harmonisation rules. The harmonisation scheme was applied to data from four diverse population cohorts.Of the 120 variables that were found in the datasets, correspondence between the harmonised data schema and cohort-specific data models was complete or close for 111 (93%). For the remainder, harmonisation was possible with a marginal a loss of granularity.Although harmonisation is not an exact science, sufficient comparability across datasets was achieved to enable data discovery with relatively little loss of informativeness. This provides a basis for further work extending harmonisation to a larger variable list, applying the harmonisation to further datasets, and incentivising the development of data discovery tools.

AB - Data discovery, the ability to find datasets relevant to an analysis, increases scientific opportunity, improves rigour and accelerates activity. Rapid growth in the depth, breadth, quantity and availability of data provides unprecedented opportunities and challenges for data discovery. A potential tool for increasing the efficiency of data discovery, particularly across multiple datasets is data harmonisation.A set of 124 variables, identified as being of broad interest to neurodegeneration, were harmonised using the C-Surv data model. Harmonisation strategies used were simple calibration, algorithmic transformation and standardisation to the Z-distribution. Widely used data conventions, optimised for inclusiveness rather than aetiological precision, were used as harmonisation rules. The harmonisation scheme was applied to data from four diverse population cohorts.Of the 120 variables that were found in the datasets, correspondence between the harmonised data schema and cohort-specific data models was complete or close for 111 (93%). For the remainder, harmonisation was possible with a marginal a loss of granularity.Although harmonisation is not an exact science, sufficient comparability across datasets was achieved to enable data discovery with relatively little loss of informativeness. This provides a basis for further work extending harmonisation to a larger variable list, applying the harmonisation to further datasets, and incentivising the development of data discovery tools.

KW - C-surv data model

KW - Cohort

KW - Data discovery

KW - Data harmonisation

KW - Data visualisation

KW - Datasets

UR - http://www.scopus.com/inward/record.url?scp=85153611238&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85153611238&partnerID=8YFLogxK

U2 - 10.1007/s10654-023-00997-3

DO - 10.1007/s10654-023-00997-3

M3 - Article

C2 - 37099244

AN - SCOPUS:85153611238

SN - 0393-2990

VL - 38

SP - 605

EP - 615

JO - European Journal of Epidemiology

JF - European Journal of Epidemiology

IS - 6

ER -

Evaluating the harmonisation potential of diverse cohort datasets

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this