2003 - Decoding Auto-Segmentation: A Deep Qualitative Analysis of Commercial Tools for Organs-at-Risk Delineation

02:30pm - 04:00pm PT

Hall F

Screen: 12

Presenter(s)

Ranjini Tolakanahalli, PhD - Miami Cancer Institute: Baptist Health South Florida, Miami, FL

E. Y. Y. Akdemir¹, R. P. Tolakanahalli^1,2, C. M. Rivera¹, D. Wang¹, S. Yarlagadda¹, A. Gutierrez¹, R. Herrera¹, D. J. Wieczorek^1,2, Y. C. Lee^1,2, V. Chaswal¹, R. Kotecha^1,3, M. P. Mehta^1,2, and M. Leyva⁴; ¹Department of Radiation Oncology, Miami Cancer Institute, Baptist Health South Florida, Miami, FL, ²Herbert Wertheim College of Medicine, Florida International University, Miami, FL, ³Miami Cancer Institute, Baptist Health South Florida, Miami, FL, ⁴Department of Radiation Oncology. Miami Cancer Institute. Baptist Health South Florida, Miami, FL

Purpose/Objective(s):

Despite significant advances in artificial intelligence (AI)–driven auto-segmentation (AS) tools, the performance of commercially available options for organs-at-risk (OARs) delineation remains variable. These tools are predominantly evaluated using quantitative (geometric) metrics which leaves their clinical relevance uncertain. This study aims to evaluate the performance and usability of widely used commercial AS algorithms using real-word patient data.

Materials/Methods:

Anonymized computed tomography (CT) datasets from patients treated for various diseases were uploaded to four contouring platforms (AC [auto-contour]) to generate contours in seven anatomical regions including brain, head and neck, thorax, breast, abdomen, and male/female pelvis. The AS platforms were labeled as AC1 through AC4 and were blinded to two evaluating physicians with >10 and >5 years of experience. The results were graded using a four-point grading system and were categorized based on the extent of revision needed and usability; usable [0: No revision; 1: minimal revision (0–10%); 2: moderate revision (10–20%)] and unusable [3: major revision (>20%)]. The cumulative counts of each revision level from both physicians were combined and then divided by the total number of segmentations to calculate the corresponding pooled percentages. The Chi-square test was used to compare the cohorts with a significance threshold set at p < 0.05.

Results:

A total of 2,327 auto-segmented contours were evaluated independently by two physicians yielding 4,654 individual assessments across the seven tested disease sites. Overall, 46.3% of contours required no revision, while 34.9%, 11.6%, and 7.3% required minimal, moderate, and major revisions, respectively. By disease site, the highest rates of unusable contours were observed in the thorax (10.9%) followed by the brain (10.0%), while the lowest rates were observed in the male pelvis (5.3%) and head and neck region (5.3%). At an individual OAR level, the optic chiasm required the most revisions (no cases were deemed acceptable without revision) whereas the spinal cord/canal had the highest acceptance rate (74.1% required no revision). When stratified by algorithm, the proportions deemed unusable were 5.5%, 7.2%, 5.1%, and 11.3% for AC1, AC2, AC3, and AC4, respectively (p<0.001).

Conclusion:

This qualitative evaluation highlights substantial variability in the performance of commercial AS platforms across different anatomical sites. Although nearly half of the contours were acceptable without any revision, 7.3% were deemed unusable, underscoring the need for algorithmic refinements and testing the “final product” in a real-world clinical environment. The optic chiasm emerged as a particularly challenging OAR and indicates a potential role for MRI-based contouring. These findings emphasize the importance of continued optimization in AS technologies to enhance their clinical utility and acceptance.