scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell OmicsDavide D'Ascenzo, Sebastiano Cultrera di Montesanohttps://arxiv.org/abs/2506.01883
scDataset: Scalable Data Loading for Deep Learning on Large-Scale Single-Cell OmicsModern single-cell datasets now comprise hundreds of millions of cells, presenting significant challenges for training deep learning models that require shuffled, memory-efficient data loading. While the AnnData format is the community standard for storing single-cell datasets, existing data loading solutions for AnnData are often inadequate: some require loading all data into memory, others convert to dense formats that increase storage demands, and many are hampered by slow random disk access…