Main Session
Sep
29
PQA 06 - Radiation and Cancer Biology, Health Care Access and Engagement
3026 - Identifying Age, Race, and Sex Biases in Public Imaging and Genomic Datasets for Radiotherapy Patients
Presenter(s)
Joseph Bae, PhD, MS, BS - Stony Brook University, Stony Brook, NY
J. Bae1, and B. Cha2; 1Department of Biomedical Informatics, Stony Brook University Hospital, Stony Brook, NY, 2Department of Radiation Oncology, Kaiser Permanente LA Medical Center, Los Angeles, CA
Purpose/Objective(s):
Medical research relies upon diverse study samples to adequately represent heterogeneous patient populations. For instance, research has heavily suggested that artificial intelligence (AI) models may exhibit biased performance when trained on non-diverse datasets. Public imaging and genomic data repositories including the Cancer Imaging Archive (TCIA) and the Cancer Genome Atlas (TCGA) are valuable resources for radiotherapy (RT) research, but in some cases provide limited patient demographic data. Here we identify all available RT datasets on TCIA and TCGA and report demographic information, comparing with the cancer patient population in the United States.Materials/Methods:
TCGA and TCIA were queried for all datasets with RT-treated patients. For each dataset, patient age, race, and sex were collected as available. T-test was used to compare age and chi-squared test was used to compare sex and race data between TCIA/TCGA data and US government-reported statistics for the 10 most represented cancer types across each dataset: prostate, brain, breast, uterine, renal, lung, ovarian, cervical, thyroid, and head/neck (HNSCC). Statistical significance was assessed at p<0.05.Results:
A total of 23,359 RT-treated patients (TCIA: 10,729,TCGA: 12,630) were identified from 81 datasets (TCIA: 39, TCGA: 42). For TCIA, 14 (36%) datasets and 6,911 (64%) patients had corresponding age and sex data available with mean age 60 (std: 12) and 2,319 (34%) female patients. 4 datasets (10%) and 1,413 (14%) patients had corresponding race data. Of these, there were 939 (66%) White, 237 (17%) Black, 45 (3%) Asian, 8 (1%) Native American, 184 (13%) Other Race patients. Patients were significantly younger (range: 3-9 years) than population statistics for all cancers studied. Female patients were significantly under-represented in all cancer types. 2 of the 4 datasets with race data (breast, lung) were significantly different from their respective cancer populations. For TCGA, all datasets contained age, sex, and race data with mean age 58 (std: 16) and 6,499 (51%) female patients. There were 9,164 (73%) White, 1,407 (11%) Black, 785 (6%) Asian, 45 (<1%) Native American, and 1,229 (10%) Unknown Race patients. Patients were significantly younger (range: 2-9 years) across all but uterine cancer. No significant difference was observed between patient sex except for a significant overrepresentation of men in lung cancer. There was a significant difference in race composition for all studied cancers other than HNSCC.Conclusion:
RT public datasets are generally not representative of cancer patient populations. This must be accounted for to avoid implicit bias in research, especially in the context of AI research where future studies might investigate techniques for minimizing the effect of non-representative datasets.