American Association for Cancer Research (AACR) 2025
April 21, 2025
Authors: Valentin Bernu, Helia Brull Corretger, Charles Lescure, Abdelkader Behdenna, Julien Haziza, Lea Meunier, Clemence Petit, Akpeli Nordor
Abstract
Secondary analysis of omic datasets holds great potential in accelerating precision oncology research and drug development. Yet, scientists lack efficient tools to assess the scope and volume of those datasets. Data heterogeneity and limited browsing features of public databases make it challenging to efficiently search for relevant datasets based on specific criteria, such as experimental designs. We introduce the latest iteration of our solution to map and navigate public databases.
Our solution extracts and preprocesses clinical data that serves as input for our classification models, which range from rule-based algorithms developed in collaboration with domain experts to large language models. We categorize datasets by cancer type, treatment modality, sequencing resolution, cell line identity, mutational status, and drug information. This iteration introduces models for predicting mutation and drug-related data and continuously updates mappings for existing models—cancer type, technology, and treatment—by incorporating new entries. Additionally, this version integrates two databases, namely “Gene Expression Omnibus” or “GEO” and ArrayExpress, and includes a dashboard that allows querying and visualization of the automatically tagged datasets.
We identified 24,131 cancer-related, patient-derived omic datasets from GEO. Our models estimate the following:
The analysis of ArrayExpress found 16,255 cancer-related datasets, 77% of which also exist on GEO.
Our unified data mapping approach streamlines the identification and querying of datasets across multiple public databases. Additionally, it uncovers otherwise hard-to-detect information and reveals meaningful trends, thereby providing researchers with valuable insights for conducting informed secondary data analyses.