The evolving landscape of cancer transcriptomics data

April 21, 2025

Authors: Valentin Bernu, Helia Brull Corretger, Charles Lescure, Abdelkader Behdenna, Julien Haziza, Lea Meunier, Clemence Petit, Akpeli Nordor

Abstract

Secondary analysis of omic datasets holds great potential in accelerating precision oncology research and drug development. Yet, scientists lack efficient tools to assess the scope and volume of those datasets. Data heterogeneity and limited browsing features of public databases make it challenging to efficiently search for relevant datasets based on specific criteria, such as experimental designs. We introduce the latest iteration of our solution to map and navigate public databases.


Our solution extracts and preprocesses clinical data that serves as input for our classification models, which range from rule-based algorithms developed in collaboration with domain experts to large language models. We categorize datasets by cancer type, treatment modality, sequencing resolution, cell line identity, mutational status, and drug information. This iteration introduces models for predicting mutation and drug-related data and continuously updates mappings for existing models—cancer type, technology, and treatment—by incorporating new entries. Additionally, this version integrates two databases, namely “Gene Expression Omnibus” or “GEO” and ArrayExpress, and includes a dashboard that allows querying and visualization of the automatically tagged datasets.

 

We identified 24,131 cancer-related, patient-derived omic datasets from GEO. Our models estimate the following:

 

  • Cancer types: the 5 most frequent indications (breast cancer, brain cancer, non-small cell lung cancer, acute myeloid leukemia, colorectal cancer) constitute about 60% of datasets.
  • Technology: RNA-Seq accounts for 29% of datasets, with an estimation of 24% being single-cell and 76% bulk RNA-Seq. RNA-Seq has become the dominant technology: in 2012, the number of microarray studies published that year outnumbered RNA-Seq studies 10:1; by 2023, RNA-Seq studies published annually had overtaken microarray studies by a comparable margin. Today, the total volume of accumulated studies for each approach is nearly equal.
  • Treatment: approximately 25% of datasets include treatment information; post-treatment samples are estimated to represent 15%.
  • Mutations: information is available in 26% of datasets. Notably, 26% of colorectal cancer datasets contain mutation information, of which 35% report KRAS mutations.
  • Drugs: information related to antineoplastic agents is found in 11% of GEO cancer datasets.
 

The analysis of ArrayExpress found 16,255 cancer-related datasets, 77% of which also exist on GEO.


Our unified data mapping approach streamlines the identification and querying of datasets across multiple public databases. Additionally, it uncovers otherwise hard-to-detect information and reveals meaningful trends, thereby providing researchers with valuable insights for conducting informed secondary data analyses.