Aberrant splicing is a major cause of genetic disorders but its direct detection in transcriptomes is limited to clinically accessible tissues such as skin or body fluids. While DNA-based machine learning models can prioritize rare variants for affecting splicing, their performance in predicting tissue-specific aberrant splicing remains unassessed. Here we generated an aberrant splicing benchmark dataset, spanning over 8.8 million rare variants in 49 human tissues from the Genotype-Tissue Expression (GTEx) dataset. At 20% recall, state-of-the-art DNA-based models achieve maximum 12% precision. By mapping and quantifying tissue-specific splice site usage transcriptome-wide and modeling isoform competition, we increased precision by threefold at the same recall. Integrating RNA-sequencing data of clinically accessible tissues into our model, AbSplice, brought precision to 60%. These results, replicated in two independent cohorts, substantially contribute to noncoding loss-of-function variant identification and to genetic diagnostics design and analytics.
FörderungenNINDS NIMH Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) NIDA NHLBI NHGRI NCI Common Fund of the Office of the Director of the National Institutes of Health Helmholtz Association EJP RD project GENOMIT ERA PerMed project PerMiM German Network for Mitochondrial Disorders German Bundesministerium fur Bildung und Forschung (BMBF)