Generative modeling of single-cell RNA-seq data is crucial for tasks like trajectory
inference, batch effect removal, and simulation of realistic cellular data. However,
recent deep generative models simulating synthetic single cells from noise operate
on pre-processed continuous gene expression approximations, overlooking the
discrete nature of single-cell data, which limits their effectiveness and hinders the
incorporation of robust noise models. Additionally, aspects like controllable multi-
modal and multi-label generation of cellular data remain underexplored. This work
introduces CellFlow for Generation (CFGen), a flow-based conditional generative
model that preserves the inherent discreteness of single-cell data. CFGen generates
whole-genome multi-modal single-cell data reliably, improving the recovery of
crucial biological data characteristics while tackling relevant generative tasks such
as rare cell type augmentation and batch correction. We also introduce a novel
framework for compositional data generation using Flow Matching. By showcasing
CFGen on a diverse set of biological datasets and settings, we provide evidence
of its value to the fields of computational biology and deep generative models.
CopyrightCopyright 2025 Society of Photo‑Optical Instrumentation Engineers (SPIE). One print or electronic copy may be made for personal use only. Systematic reproduction and distribution, duplication of any material in this publication for a fee or for commercial purposes, and modification of the contents of the publication are prohibited.