TY - JOUR
T1 - A study of crowdsourced segment-level surgical skill assessment using pairwise rankings
AU - Malpani, Anand
AU - Vedula, S. Swaroop
AU - Chen, Chi Chiung Grace
AU - Hager, Gregory D.
N1 - Funding Information:
We acknowledge all participants in our crowdsourcing user study, and Intuitive surgical, Inc., for facilitating capture of data from the dVSS. A combined effort from the Language of Surgery project team led to the development of the manual task segmentation. The Johns Hopkins Science of Learning Institute and internal funding from the Johns Hopkins University supported this work.
Publisher Copyright:
© 2015, CARS.
PY - 2015/9/13
Y1 - 2015/9/13
N2 - Purpose: Currently available methods for surgical skills assessment are either subjective or only provide global evaluations for the overall task. Such global evaluations do not inform trainees about where in the task they need to perform better. In this study, we investigated the reliability and validity of a framework to generate objective skill assessments for segments within a task, and compared assessments from our framework using crowdsourced segment ratings from surgically untrained individuals and expert surgeons against manually assigned global rating scores. Methods: Our framework includes (1) a binary classifier trained to generate preferences for pairs of task segments (i.e., given a pair of segments, specification of which one was performed better), (2) computing segment-level percentile scores based on the preferences, and (3) predicting task-level scores using the segment-level scores. We conducted a crowdsourcing user study to obtain manual preferences for segments within a suturing and knot-tying task from a crowd of surgically untrained individuals and a group of experts. We analyzed the inter-rater reliability of preferences obtained from the crowd and experts, and investigated the validity of task-level scores obtained using our framework. In addition, we compared accuracy of the crowd and expert preference classifiers, as well as the segment- and task-level scores obtained from the classifiers. Results: We observed moderate inter-rater reliability within the crowd (Fleiss’ kappa, $$\kappa = 0.41$$κ=0.41) and experts ($$\kappa = 0.55$$κ=0.55). For both the crowd and experts, the accuracy of an automated classifier trained using all the task segments was above par as compared to the inter-rater agreement [crowd classifier 85 % (SE 2 %), expert classifier 89 % (SE 3 %)]. We predicted the overall global rating scores (GRS) for the task with a root-mean-squared error that was lower than one standard deviation of the ground-truth GRS. We observed a high correlation between segment-level scores ($$\rho \ge 0.86$$ρ≥0.86) obtained using the crowd and expert preference classifiers. The task-level scores obtained using the crowd and expert preference classifier were also highly correlated with each other ($$\rho \ge 0.84$$ρ≥0.84), and statistically equivalent within a margin of two points (for a score ranging from 6 to 30). Our analyses, however, did not demonstrate statistical significance in equivalence of accuracy between the crowd and expert classifiers within a 10 % margin. Conclusions: Our framework implemented using crowdsourced pairwise comparisons leads to valid objective surgical skill assessment for segments within a task, and for the task overall. Crowdsourcing yields reliable pairwise comparisons of skill for segments within a task with high efficiency. Our framework may be deployed within surgical training programs for objective, automated, and standardized evaluation of technical skills.
AB - Purpose: Currently available methods for surgical skills assessment are either subjective or only provide global evaluations for the overall task. Such global evaluations do not inform trainees about where in the task they need to perform better. In this study, we investigated the reliability and validity of a framework to generate objective skill assessments for segments within a task, and compared assessments from our framework using crowdsourced segment ratings from surgically untrained individuals and expert surgeons against manually assigned global rating scores. Methods: Our framework includes (1) a binary classifier trained to generate preferences for pairs of task segments (i.e., given a pair of segments, specification of which one was performed better), (2) computing segment-level percentile scores based on the preferences, and (3) predicting task-level scores using the segment-level scores. We conducted a crowdsourcing user study to obtain manual preferences for segments within a suturing and knot-tying task from a crowd of surgically untrained individuals and a group of experts. We analyzed the inter-rater reliability of preferences obtained from the crowd and experts, and investigated the validity of task-level scores obtained using our framework. In addition, we compared accuracy of the crowd and expert preference classifiers, as well as the segment- and task-level scores obtained from the classifiers. Results: We observed moderate inter-rater reliability within the crowd (Fleiss’ kappa, $$\kappa = 0.41$$κ=0.41) and experts ($$\kappa = 0.55$$κ=0.55). For both the crowd and experts, the accuracy of an automated classifier trained using all the task segments was above par as compared to the inter-rater agreement [crowd classifier 85 % (SE 2 %), expert classifier 89 % (SE 3 %)]. We predicted the overall global rating scores (GRS) for the task with a root-mean-squared error that was lower than one standard deviation of the ground-truth GRS. We observed a high correlation between segment-level scores ($$\rho \ge 0.86$$ρ≥0.86) obtained using the crowd and expert preference classifiers. The task-level scores obtained using the crowd and expert preference classifier were also highly correlated with each other ($$\rho \ge 0.84$$ρ≥0.84), and statistically equivalent within a margin of two points (for a score ranging from 6 to 30). Our analyses, however, did not demonstrate statistical significance in equivalence of accuracy between the crowd and expert classifiers within a 10 % margin. Conclusions: Our framework implemented using crowdsourced pairwise comparisons leads to valid objective surgical skill assessment for segments within a task, and for the task overall. Crowdsourcing yields reliable pairwise comparisons of skill for segments within a task with high efficiency. Our framework may be deployed within surgical training programs for objective, automated, and standardized evaluation of technical skills.
KW - Activity segments
KW - Crowdsourcing
KW - Feedback
KW - Pairwise comparisons
KW - Robotic surgery
KW - Skill assessment
KW - Task decomposition
KW - Task flow
KW - Training
UR - http://www.scopus.com/inward/record.url?scp=84941418149&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84941418149&partnerID=8YFLogxK
U2 - 10.1007/s11548-015-1238-6
DO - 10.1007/s11548-015-1238-6
M3 - Article
C2 - 26133652
AN - SCOPUS:84941418149
SN - 1861-6410
VL - 10
SP - 1435
EP - 1447
JO - International Journal of Computer Assisted Radiology and Surgery
JF - International Journal of Computer Assisted Radiology and Surgery
IS - 9
ER -