Multivariate longitudinal data for survival analysis of cardiovascular event prediction in young adults: insights from a comparative explainable study

Hieu T. Nguyen; Henrique D. Vasconcellos; Kimberley Keck; Jared P. Reis; Cora E. Lewis; Steven Sidney; Donald M. Lloyd-Jones; Pamela J. Schreiner; Eliseo Guallar; Colin O. Wu; João A.C. Lima; Bharath Ambale-Venkatesh

doi:10.1186/s12874-023-01845-4

Multivariate longitudinal data for survival analysis of cardiovascular event prediction in young adults: insights from a comparative explainable study

Hieu T. Nguyen, Henrique D. Vasconcellos, Kimberley Keck, Jared P. Reis, Cora E. Lewis, Steven Sidney, Donald M. Lloyd-Jones, Pamela J. Schreiner, Eliseo Guallar, Colin O. Wu, João A.C. Lima, Bharath Ambale-Venkatesh

Research output: Contribution to journal › Article › peer-review

Abstract

Background: Multivariate longitudinal data are under-utilized for survival analysis compared to cross-sectional data (CS - data collected once across cohort). Particularly in cardiovascular risk prediction, despite available methods of longitudinal data analysis, the value of longitudinal information has not been established in terms of improved predictive accuracy and clinical applicability. Methods: We investigated the value of longitudinal data over and above the use of cross-sectional data via 6 distinct modeling strategies from statistics, machine learning, and deep learning that incorporate repeated measures for survival analysis of the time-to-cardiovascular event in the Coronary Artery Risk Development in Young Adults (CARDIA) cohort. We then examined and compared the use of model-specific interpretability methods (Random Survival Forest Variable Importance) and model-agnostic methods (SHapley Additive exPlanation (SHAP) and Temporal Importance Model Explanation (TIME)) in cardiovascular risk prediction using the top-performing models. Results: In a cohort of 3539 participants, longitudinal information from 35 variables that were repeatedly collected in 6 exam visits over 15 years improved subsequent long-term (17 years after) risk prediction by up to 8.3% in C-index compared to using baseline data (0.78 vs. 0.72), and up to approximately 4% compared to using the last observed CS data (0.75). Time-varying AUC was also higher in models using longitudinal data (0.86–0.87 at 5 years, 0.79–0.81 at 10 years) than using baseline or last observed CS data (0.80–0.86 at 5 years, 0.73–0.77 at 10 years). Comparative model interpretability analysis revealed the impact of longitudinal variables on model prediction on both the individual and global scales among different modeling strategies, as well as identifying the best time windows and best timing within that window for event prediction. The best strategy to incorporate longitudinal data for accuracy was time series massive feature extraction, and the easiest interpretable strategy was trajectory clustering. Conclusion: Our analysis demonstrates the added value of longitudinal data in predictive accuracy and epidemiological utility in cardiovascular risk survival analysis in young adults via a unified, scalable framework that compares model performance and explainability. The framework can be extended to a larger number of variables and other longitudinal modeling methods. Trial registration: ClinicalTrials.gov Identifier: NCT00005130, Registration Date: 26/05/2000.

Original language	English (US)
Article number	23
Journal	BMC medical research methodology
Volume	23
Issue number	1
DOIs	https://doi.org/10.1186/s12874-023-01845-4
State	Published - Dec 2023

Keywords

CARDIA
Explainable AI
Longitudinal data
Personalized medicine
Repeated measures
Risk prediction
SHAP
Survival analysis
TIME
Time-varying covariates

ASJC Scopus subject areas

Health Informatics
Epidemiology

Access to Document

10.1186/s12874-023-01845-4

Cite this

Nguyen, H. T., Vasconcellos, H. D., Keck, K., Reis, J. P., Lewis, C. E., Sidney, S., Lloyd-Jones, D. M., Schreiner, P. J., Guallar, E., Wu, C. O., Lima, J. A. C., & Ambale-Venkatesh, B. (2023). Multivariate longitudinal data for survival analysis of cardiovascular event prediction in young adults: insights from a comparative explainable study. BMC medical research methodology, 23(1), Article 23. https://doi.org/10.1186/s12874-023-01845-4

Nguyen, HT, Vasconcellos, HD, Keck, K, Reis, JP, Lewis, CE, Sidney, S, Lloyd-Jones, DM, Schreiner, PJ, Guallar, E, Wu, CO, Lima, JAC & Ambale-Venkatesh, B 2023, 'Multivariate longitudinal data for survival analysis of cardiovascular event prediction in young adults: insights from a comparative explainable study', BMC medical research methodology, vol. 23, no. 1, 23. https://doi.org/10.1186/s12874-023-01845-4

@article{6d3d2b923ab345388627e41350b5b13b,

title = "Multivariate longitudinal data for survival analysis of cardiovascular event prediction in young adults: insights from a comparative explainable study",

abstract = "Background: Multivariate longitudinal data are under-utilized for survival analysis compared to cross-sectional data (CS - data collected once across cohort). Particularly in cardiovascular risk prediction, despite available methods of longitudinal data analysis, the value of longitudinal information has not been established in terms of improved predictive accuracy and clinical applicability. Methods: We investigated the value of longitudinal data over and above the use of cross-sectional data via 6 distinct modeling strategies from statistics, machine learning, and deep learning that incorporate repeated measures for survival analysis of the time-to-cardiovascular event in the Coronary Artery Risk Development in Young Adults (CARDIA) cohort. We then examined and compared the use of model-specific interpretability methods (Random Survival Forest Variable Importance) and model-agnostic methods (SHapley Additive exPlanation (SHAP) and Temporal Importance Model Explanation (TIME)) in cardiovascular risk prediction using the top-performing models. Results: In a cohort of 3539 participants, longitudinal information from 35 variables that were repeatedly collected in 6 exam visits over 15 years improved subsequent long-term (17 years after) risk prediction by up to 8.3% in C-index compared to using baseline data (0.78 vs. 0.72), and up to approximately 4% compared to using the last observed CS data (0.75). Time-varying AUC was also higher in models using longitudinal data (0.86–0.87 at 5 years, 0.79–0.81 at 10 years) than using baseline or last observed CS data (0.80–0.86 at 5 years, 0.73–0.77 at 10 years). Comparative model interpretability analysis revealed the impact of longitudinal variables on model prediction on both the individual and global scales among different modeling strategies, as well as identifying the best time windows and best timing within that window for event prediction. The best strategy to incorporate longitudinal data for accuracy was time series massive feature extraction, and the easiest interpretable strategy was trajectory clustering. Conclusion: Our analysis demonstrates the added value of longitudinal data in predictive accuracy and epidemiological utility in cardiovascular risk survival analysis in young adults via a unified, scalable framework that compares model performance and explainability. The framework can be extended to a larger number of variables and other longitudinal modeling methods. Trial registration: ClinicalTrials.gov Identifier: NCT00005130, Registration Date: 26/05/2000.",

keywords = "CARDIA, Explainable AI, Longitudinal data, Personalized medicine, Repeated measures, Risk prediction, SHAP, Survival analysis, TIME, Time-varying covariates",

author = "Nguyen, {Hieu T.} and Vasconcellos, {Henrique D.} and Kimberley Keck and Reis, {Jared P.} and Lewis, {Cora E.} and Steven Sidney and Lloyd-Jones, {Donald M.} and Schreiner, {Pamela J.} and Eliseo Guallar and Wu, {Colin O.} and Lima, {Jo{\~a}o A.C.} and Bharath Ambale-Venkatesh",

note = "Publisher Copyright: {\textcopyright} 2023, The Author(s).",

year = "2023",

month = dec,

doi = "10.1186/s12874-023-01845-4",

language = "English (US)",

volume = "23",

journal = "BMC medical research methodology",

issn = "1471-2288",

publisher = "BioMed Central",

number = "1",

}

TY - JOUR

T1 - Multivariate longitudinal data for survival analysis of cardiovascular event prediction in young adults

T2 - insights from a comparative explainable study

AU - Nguyen, Hieu T.

AU - Vasconcellos, Henrique D.

AU - Keck, Kimberley

AU - Reis, Jared P.

AU - Lewis, Cora E.

AU - Sidney, Steven

AU - Lloyd-Jones, Donald M.

AU - Schreiner, Pamela J.

AU - Guallar, Eliseo

AU - Wu, Colin O.

AU - Lima, João A.C.

AU - Ambale-Venkatesh, Bharath

PY - 2023/12

Y1 - 2023/12

N2 - Background: Multivariate longitudinal data are under-utilized for survival analysis compared to cross-sectional data (CS - data collected once across cohort). Particularly in cardiovascular risk prediction, despite available methods of longitudinal data analysis, the value of longitudinal information has not been established in terms of improved predictive accuracy and clinical applicability. Methods: We investigated the value of longitudinal data over and above the use of cross-sectional data via 6 distinct modeling strategies from statistics, machine learning, and deep learning that incorporate repeated measures for survival analysis of the time-to-cardiovascular event in the Coronary Artery Risk Development in Young Adults (CARDIA) cohort. We then examined and compared the use of model-specific interpretability methods (Random Survival Forest Variable Importance) and model-agnostic methods (SHapley Additive exPlanation (SHAP) and Temporal Importance Model Explanation (TIME)) in cardiovascular risk prediction using the top-performing models. Results: In a cohort of 3539 participants, longitudinal information from 35 variables that were repeatedly collected in 6 exam visits over 15 years improved subsequent long-term (17 years after) risk prediction by up to 8.3% in C-index compared to using baseline data (0.78 vs. 0.72), and up to approximately 4% compared to using the last observed CS data (0.75). Time-varying AUC was also higher in models using longitudinal data (0.86–0.87 at 5 years, 0.79–0.81 at 10 years) than using baseline or last observed CS data (0.80–0.86 at 5 years, 0.73–0.77 at 10 years). Comparative model interpretability analysis revealed the impact of longitudinal variables on model prediction on both the individual and global scales among different modeling strategies, as well as identifying the best time windows and best timing within that window for event prediction. The best strategy to incorporate longitudinal data for accuracy was time series massive feature extraction, and the easiest interpretable strategy was trajectory clustering. Conclusion: Our analysis demonstrates the added value of longitudinal data in predictive accuracy and epidemiological utility in cardiovascular risk survival analysis in young adults via a unified, scalable framework that compares model performance and explainability. The framework can be extended to a larger number of variables and other longitudinal modeling methods. Trial registration: ClinicalTrials.gov Identifier: NCT00005130, Registration Date: 26/05/2000.

AB - Background: Multivariate longitudinal data are under-utilized for survival analysis compared to cross-sectional data (CS - data collected once across cohort). Particularly in cardiovascular risk prediction, despite available methods of longitudinal data analysis, the value of longitudinal information has not been established in terms of improved predictive accuracy and clinical applicability. Methods: We investigated the value of longitudinal data over and above the use of cross-sectional data via 6 distinct modeling strategies from statistics, machine learning, and deep learning that incorporate repeated measures for survival analysis of the time-to-cardiovascular event in the Coronary Artery Risk Development in Young Adults (CARDIA) cohort. We then examined and compared the use of model-specific interpretability methods (Random Survival Forest Variable Importance) and model-agnostic methods (SHapley Additive exPlanation (SHAP) and Temporal Importance Model Explanation (TIME)) in cardiovascular risk prediction using the top-performing models. Results: In a cohort of 3539 participants, longitudinal information from 35 variables that were repeatedly collected in 6 exam visits over 15 years improved subsequent long-term (17 years after) risk prediction by up to 8.3% in C-index compared to using baseline data (0.78 vs. 0.72), and up to approximately 4% compared to using the last observed CS data (0.75). Time-varying AUC was also higher in models using longitudinal data (0.86–0.87 at 5 years, 0.79–0.81 at 10 years) than using baseline or last observed CS data (0.80–0.86 at 5 years, 0.73–0.77 at 10 years). Comparative model interpretability analysis revealed the impact of longitudinal variables on model prediction on both the individual and global scales among different modeling strategies, as well as identifying the best time windows and best timing within that window for event prediction. The best strategy to incorporate longitudinal data for accuracy was time series massive feature extraction, and the easiest interpretable strategy was trajectory clustering. Conclusion: Our analysis demonstrates the added value of longitudinal data in predictive accuracy and epidemiological utility in cardiovascular risk survival analysis in young adults via a unified, scalable framework that compares model performance and explainability. The framework can be extended to a larger number of variables and other longitudinal modeling methods. Trial registration: ClinicalTrials.gov Identifier: NCT00005130, Registration Date: 26/05/2000.

KW - CARDIA

KW - Explainable AI

KW - Longitudinal data

KW - Personalized medicine

KW - Repeated measures

KW - Risk prediction

KW - SHAP

KW - Survival analysis

KW - TIME

KW - Time-varying covariates

UR - http://www.scopus.com/inward/record.url?scp=85146815172&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85146815172&partnerID=8YFLogxK

U2 - 10.1186/s12874-023-01845-4

DO - 10.1186/s12874-023-01845-4

M3 - Article

C2 - 36698064

AN - SCOPUS:85146815172

SN - 1471-2288

VL - 23

JO - BMC medical research methodology

JF - BMC medical research methodology

IS - 1

M1 - 23

ER -

Multivariate longitudinal data for survival analysis of cardiovascular event prediction in young adults: insights from a comparative explainable study

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this