3660 - Seeing the Dose: Vision-Language AI for Automated Prostate Radiotherapy Dose Plan Quality Assessment
Presenter(s)
X. Li1, H. Wang1,2, Q. Man1, and J. Wang1; 1Department of Radiation Oncology, Peking University Third Hospital, Beijing, China, 2Cancer Center,Peking University Third Hospital, Beijing, China
Purpose/Objective(s): Visual assessment of radiation therapy (RT) dose distributions offers critical complementary insights beyond standard numerical DVH metrics by exposing spatially localized issues. This study investigates the feasibility of employing Vision-Language Large Models (VLLMs) to automatically detect clinically significant dose errors in prostate cancer radiotherapy plans. We hypothesize that by leveraging their general visual understanding, VLLMs can outperform traditional fixed-quality detection tools, especially when equipped with appropriate prompt strategies and fine-tuning.
Materials/Methods: Retrospective data from 34 prostate cancer patients (20 for fine-tuning, 14 for evaluation) were used, including CT images, RT Dose, and RT Structure Sets. Clinically accepted plans served as negative samples. Four error scenarios—Rectum hot spot, Bladder hot spot, PTV cold spot, and PTV underdosage—were programmatically generated from the accepted images to create positive samples, using TiGRT-TPS. For visualization, dose distributions and isolines, along with structure contours, were overlaid on CT images. The VLLMs were prompted to assess each error by responding with true/false and providing explanations. The responses were evaluated by a senior physicist based on sensitivity and specificity. Four VLLMs were tested: three out-of-the-box models (GPT-4o, Gemini 2.0 Pro, and Qwen2.5-VL-7B) and one finetuned model (Qwen2.5-VL-7B, fine-tuned on 20 patient).
Results: Among the out-of-the-box models, Qwen2.5-VL-7B achieved an average sensitivity of 75% and specificity of 73%. Gemini 2.0 Pro followed with 70% sensitivity and 68% specificity, while GPT-4o had the lowest performance at 65% sensitivity and 64% specificity. The fine-tuned Qwen2.5-VL-7B markedly improved, reaching 93% sensitivity and 94% specificity (95%/96% for Rectum hot spot, 92%/93% for Bladder hot spot, 94%/95% for PTV cold spot, and 91%/92% for PTV underdosage). Out-of-the-box VLLMs performed poorly for automated RT dose assessment, likely due to the lack of domain-specific training data. Interestingly, despite their larger parameter sizes, models such as GPT-4o and Gemini 2.0 Pro underperformed relative to the smaller Qwen2.5-VL-7B, suggesting that data quality and domain relevance outweigh model size. The significant performance boost after fine-tuning Qwen2.5-VL-7B with only a few samples highlights the strong general visual understanding of VLLMs, enabling effective few-shot adaptation.
Conclusion: These findings highlight the capability of fine-tuned VLLMs to strengthen quality assurance in radiotherapy by automating the detection of clinically significant dose errors, which reduces clinician workload and enhances patient safety. Furthermore, the fine-tuned VLLM holds potential as a reward model for auto-planning, enabling it to provide feedback that refines treatment plans automatically and supports a more efficient radiotherapy process.