American Association for Cancer Research (AACR) 2025
April 21, 2025
Authors: Valentin Bernu; Guillaume Appe; Abdelkader Behdenna; Antoine Gaston; Julien Haziza; Akpeli Nordor
Abstract
Secondary analysis of omic datasets holds great potential in accelerating precision oncology research and drug development. Despite this potential, a major challenge lies in harmonizing clinical information across diverse data sources. Although natural language processing techniques perform well in harmonizing many clinical variables, their performance diminishes when dealing with more intricate ones (e.g., histological type). Here, we introduce a method to infer clinical information directly from molecular data.
We identify the most informative genes from the gene expression data using an analysis of variance for each clinical variable. Then, we train machine learning classifiers (XGBoosts) on the selected features. The Cancer Genome Atlas (TCGA) pancancer database is used for training and Gene Expression Omnibus (GEO) database is used for independent testing. Expert scientists labeled clinical variables from both sources. Due to underrepresentation in TCGA, some classes of clinical variables are excluded from the training and test sets. TCGA encompasses 11, 000 transcriptomic profiles spanning 33 cancer types, while the 210 aggregated GEO datasets include 7, 077 transcriptomic profiles across 13 cancer types. A total of 15, 430 genes are profiled in both TCGA and these GEO datasets.
We assessed our methods on five clinical variables. The encouraging accuracy of current models for biopsy site, sample type, and histological subtype suggests that they are effectively capturing the underlying patterns. For primary site and histological type classifications, the models achieved high accuracies and macro-averaged F1 scores: 93% accuracy with a 0.89 F1 score across 11 classes for primary site, and 96% accuracy with a 0.91 F1 score across 6 classes for histological type.
This method offers a significant opportunity to complement expert data labeling, addressing the challenge of scaling clinical data harmonization for the secondary analysis of omic data across multiple applications in cancer drug discovery and development. Our findings are particularly encouraging for inferring clinical details related to primary sites and histological types, with cross-database training and testing further boosting its generalizability. The strong performance across these two clinical variables underscores the potential of our approach. Nonetheless, the performance has been evaluated on a limited subset of classes of the clinical variables, highlighting the need for further validation with more diverse datasets to further unlock the potential of the method.