Empirical Bayesian batch effect correction for single-cell RNA-seq data

April 21, 2025

Authors: Maximilien Colange; Guillaume Appe; Lea Meunier; Solene Weill; Akpeli Nordor; Abdelkader Behdenna

Abstract

We introduce pycombat_sc, a Bayesian method to assess and correct batch effects in single-cell RNA-Seq data (scRNA-Seq).scRNA-Seq data is similar to bulk RNA-Seq data, except for a higher dropout rate, due to the much lower quantity of genetic material, which may drop below the detection threshold. Bayesian approaches for RNA-Seq data analysis (ComBat-Seq, DESeq2, edgeR) model RNA-Seq expression data with a negative binomial distribution. As suggested earlier (ZINB-WaVe) we propose to model scRNA-Seq data with a zero-inflated negative binomial distribution, where the inflation of zeroes accounts for the higher dropout rate.

 

Our approach is the continuation of ComBat and ComBat-Seq, which model microarray and RNA-Seq data with normal and negative binomial distributions, respectively. Our approach extends the ComBat-Seq model with a zero-inflation parameter. We propose two variants of the algorithm:• one where the zero-inflation component is batch-dependent – i.e. the drop-out rate differs from one batch to another;• one where the zero-inflation component is batch-independent – i.e. the drop-out rate is the same for all batches.

 

Parametric Bayesian approaches, such as ComBat or ComBat-Seq, are recognized for their versatility (through their ability to account for user-provided covariates in the model) and for their robustness to small-size cohorts. pycombat_sc retains the versatility of its predecessors with the ability to account for arbitrary covariates. This is in contrast to recent approaches based on deep learning, which typically require training data which incurs a dual limitations:• these methods are sensitive to any bias present in the training data. Combined with the lack of interpretability of the deep-learning models used, this shortcoming requires a thorough validation, to properly detect any bias in the training data.• training of deep-learning typically requires large amounts of data. While scRNA-Seq cohorts typically feature hundreds or thousands of cells, the stake here is to properly capture inter-cohort variability, or risk over-fitting to intra-cohort variability.

 

To properly assess the potential of our approach, we compare it to state-of-the-art tools, either based on clustering (Seurat, harmony) or on deep-learning models (scvi, scanvi, scanorama). For the sake of completeness, we also include ComBat and ComBat-Seq in our benchmark. To our knowledge, our study is the first to evaluate ComBat-Seq for batch effect correction in scRNA-Seq data, although its model is a priori more adapted to scRNA-Seq than the one of ComBat.