TY - JOUR
T1 - Generalizability and Bias in a Deep Learning Pediatric Bone Age Prediction Model Using Hand Radiographs
AU - Beheshtian, Elham
AU - Putman, Kristin
AU - Santomartino, Samantha M.
AU - Parekh, Vishwa S.
AU - Yi, Paul H.
N1 - Funding Information:
Author contributions: Guarantors of integrity of entire study, E.B., P.H.Y.; study concepts/study design or data acquisition or data analysis/interpretation, all authors; manuscript drafting or manuscript revision for important intellectual content, all authors; approval of final version of submitted manuscript, all authors; agrees to ensure any questions related to the work are appropriately resolved, all authors; literature research, E.B., S.M.S., V.S.P., P.H.Y.; clinical studies, E.B.; experimental studies, all authors; statistical analysis, all authors; and manuscript editing, all authors Disclosures of conflicts of interest: E.B. No relevant relationships. K.P. No relevant relationships. S.M.S. No relevant relationships. V.S.P. No relevant relationships. P.H.Y. RSNA Resident Research Grant, Johns Hopkins Discovery Award, and Johns Hopkins Malone Center Seed Grant; consulting fees from Bunkerhill Health and FH Ortho; associate editor and former Trainee Editorial Board editor for Radiology: Artificial Intelligence.
Publisher Copyright:
© RSNA, 2022.
PY - 2023/3
Y1 - 2023/3
N2 - Background: Although deep learning (DL) models have demonstrated expert-level ability for pediatric bone age prediction, they have shown poor generalizability and bias in other use cases. Purpose: To quantify generalizability and bias in a bone age DL model measured by performance on external versus internal test sets and performance differences between different demographic groups, respectively. Materials and Methods: The winning DL model of the 2017 RSNA Pediatric Bone Age Challenge was retrospectively evaluated and trained on 12 611 pediatric hand radiographs from two U.S. hospitals. The DL model was tested from September 2021 to December 2021 on an internal validation set and an external test set of pediatric hand radiographs with diverse demographic representation. Images reporting ground-truth bone age were included for study. Mean absolute difference (MAD) between ground-truth bone age and the model prediction bone age was calculated for each set. Generalizability was evaluated by comparing MAD between internal and external evaluation sets with use of t tests. Bias was evaluated by comparing MAD and clinically significant error rate (rate of errors changing the clinical diagnosis) between demographic groups with use of t tests or analysis of variance and χ2 tests, respectively (statistically significant difference defined as P <.05). Results: The internal validation set had images from 1425 individuals (773 boys), and the external test set had images from 1202 individuals (mean age, 133 months ± 60 [SD]; 614 boys). The bone age model generalized well to the external test set, with no difference in MAD (6.8 months in the validation set vs 6.9 months in the external set; P =.64). Model predictions would have led to clinically significant errors in 194 of 1202 images (16%) in the external test set. The MAD was greater for girls than boys in the internal validation set (P =.01) and in the subcategories of age and Tanner stage in the external test set (P <.001 for both). Conclusion: A deep learning (DL) bone age model generalized well to an external test set, although clinically significant sex-, age-, and sexual maturity–based biases in DL bone age were identified.
AB - Background: Although deep learning (DL) models have demonstrated expert-level ability for pediatric bone age prediction, they have shown poor generalizability and bias in other use cases. Purpose: To quantify generalizability and bias in a bone age DL model measured by performance on external versus internal test sets and performance differences between different demographic groups, respectively. Materials and Methods: The winning DL model of the 2017 RSNA Pediatric Bone Age Challenge was retrospectively evaluated and trained on 12 611 pediatric hand radiographs from two U.S. hospitals. The DL model was tested from September 2021 to December 2021 on an internal validation set and an external test set of pediatric hand radiographs with diverse demographic representation. Images reporting ground-truth bone age were included for study. Mean absolute difference (MAD) between ground-truth bone age and the model prediction bone age was calculated for each set. Generalizability was evaluated by comparing MAD between internal and external evaluation sets with use of t tests. Bias was evaluated by comparing MAD and clinically significant error rate (rate of errors changing the clinical diagnosis) between demographic groups with use of t tests or analysis of variance and χ2 tests, respectively (statistically significant difference defined as P <.05). Results: The internal validation set had images from 1425 individuals (773 boys), and the external test set had images from 1202 individuals (mean age, 133 months ± 60 [SD]; 614 boys). The bone age model generalized well to the external test set, with no difference in MAD (6.8 months in the validation set vs 6.9 months in the external set; P =.64). Model predictions would have led to clinically significant errors in 194 of 1202 images (16%) in the external test set. The MAD was greater for girls than boys in the internal validation set (P =.01) and in the subcategories of age and Tanner stage in the external test set (P <.001 for both). Conclusion: A deep learning (DL) bone age model generalized well to an external test set, although clinically significant sex-, age-, and sexual maturity–based biases in DL bone age were identified.
UR - http://www.scopus.com/inward/record.url?scp=85147047655&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85147047655&partnerID=8YFLogxK
U2 - 10.1148/radiol.220505
DO - 10.1148/radiol.220505
M3 - Article
C2 - 36165796
AN - SCOPUS:85147047655
SN - 0033-8419
VL - 306
JO - RADIOLOGY
JF - RADIOLOGY
IS - 2
M1 - e220505
ER -