2092 - Longitudinal Performance Assessment of an AI-Based Auto-Segmentation Tool Using Statistical Process Control
Presenter(s)
D. Hobbis, Z. Chen, M. R. Waters, J. Hansen, N. Knutson, and G. D. Hugo; WashU Medicine, Department of Radiation Oncology, St. Louis, MO
Purpose/Objective(s): Artificial intelligence (AI) models including auto-segmentation are trained on historic data and then this static model is deployed in a clinical setting. Changes in the input data including planning imaging may cause performance drift, as can operating the AI model on data outside of the training distribution. Modifying and approving auto-segmented volumes may result in user bias. Automation bias and performance drift are some of the crucial factors to consider when implementing AI systems. We aim to determine the impact of multiple clinical factors on the longitudinal performance of a commercial auto-segmentation tool within our institution.
Materials/Methods: Retrospective datasets (n=215) for thorax and pelvis sites were collected over a year for three discrete months, every six months. Month 1 was acquired immediately after commissioning and deployment of the AI system into clinical operation. Each dataset consists of auto-segmented and physician-approved volumes. Congruence between the volumes was quantified by the dice similarity coefficient (DSC) and Hausdorff distance (HD). Statistical analysis on various clinical factors (month since deployment, scanner model, slice thickness) was performed using the Mann-Whitney U test. Statistical Process Control (SPC) was implemented for each site by setting 99th percentile control limits from the first month’s data and then measuring for DSC outside of the control limits in months 2 and 3.
Results: 2475 total inferences were made, spanning 24 organs-at-risk (OARs) across the two body sites. The mean DSC (HD) values for all OARs were 0.93 (11.4 mm), 0.94 (9.2 mm), and 0.94 (9.4 mm) for months one, two, and three. Significant differences in individual OARs by month were observed for four and two structures for DSC and HD, respectively. The thoracic data comparing 2 mm and 3 mm slice thickness indicates differences for six and five OARs for DSC and HD. Whereas, for the pelvic region (1.5 mm vs 3 mm), only a single OAR differed for DSC. For SPC, 4 pelvic and 2 thoracic organs had higher than the expected number cases of out of control limits, including femurs, bladder, penile bulb, heart, and ribs. Individual review of out of distribution cases showed these deviations were due to various sources including presence of metal artifact, pathology, and abnormal anatomy.
Conclusion: Overall performance of a commercial, AI-based auto-segmentation tool showed robust performance with minimal measurable temporal drift. Sensitivity of performance to image factors was dependent on body site and organ. Statistical process control identified out of distribution performance which may be useful for monitoring this clinical process.