PuSH - Publikationsserver des Helmholtz Zentrums München

Fernández, J.G.* ; Pfundner, A.* ; Garcia Perez, C.

Bridging taxonomic gaps in microbial community profiling with LSTM-generated synthetic full-length 16S rRNA sequences.

ISME Commun., DOI: 10.1093/ismeco/ycag112 (2026)
Verlagsversion DOI
Open Access Gold
Creative Commons Lizenzvertrag
Microbial community profiling relies on comprehensive reference databases, yet full-length 16S rRNA amplicons remainsparse for many bacterial taxa. We present SGenerator, a neural network-based data augmentation method that generatesbiologically informative, full-length (1500 bp) 16S rRNA sequences for underrepresented genera. Combining time seriesforecasting and natural language processing, SGenerator uses an LSTM architecture with a sliding-window approachand n-gram segmentation to generate full-length amplicons. Trained on a subset of 2,289 sequences from 50 differentunbalanced genera of a total of 184,732 high-quality sequences from the RiboGrove database, it produced 500 syntheticsequences per genus across 50 genera. BLASTn validation showed that an average of 300 sequences per genus closelymatched native entries, and R2DT analysis confirmed that an average of 244 per genus folded into canonical 16S rRNAsecondary structures, indicating strong biological fidelity. Classifiers trained on the augmented datasets achieved F1and MCC scores of 0.90 on ITGDB and 0.75 on the more specialized MiDAS dataset, with k-mer embeddings slightlyoutperforming transformer-based representations. These results demonstrate that LSTM-driven sequence generation caneffectively fill taxonomic gaps in full-length amplicon databases, overcome hypervariable region biases in short-read data,and has the potential to enhance microbial profiling accuracy in ecological studies.
Altmetric
Weitere Metriken?
Zusatzinfos bearbeiten [➜Einloggen]
Publikationstyp Artikel: Journalartikel
Dokumenttyp Wissenschaftlicher Artikel
Schlagwörter Hypervariable Region ; Amplicon ; 16s Ribosomal Rna ; Amplicon Sequencing ; Profiling (computer Programming) ; Taxonomic Rank ; Biological Classification
ISSN (print) / ISBN 2730-6151
e-ISSN 2730-6151
Zeitschrift ISME Communications
Verlag Springer
Begutachtungsstatus Peer reviewed
Institut(e) Strategy and Digitalization (DIG)