TY - JOUR AB - MOTIVATION: Recent pandemics have revealed significant gaps in our understanding of viral pathogenesis, exposing an urgent need for methods to identify and prioritize key host proteins (host factors) as potential targets for antiviral treatments. De novo generation of experimental datasets is limited by their heterogeneity, and for looming future pandemics, may not be feasible due to limitations of experimental approaches. RESULTS: Here we present TransFactor, a computational framework for predicting and prioritizing candidate host factors using only protein sequence data. It leverages the pre-trained ESM-2 protein language model, fine-tuned on a limited set of experimentally determined host factors aggregated from 33 independent SARS-CoV-2 studies. TransFactor outperforms machine and deep learning baselines and its predictions align with Gene Ontology enrichments of known host factors, but also provide interpretability through a computational alanine scan, enabling the identification of pro-viral protein domains such as COMM, PX, and RRM, that may be used to direct experimental investigations of virus biology and guide rational design of antiviral therapies. Our findings demonstrate the potential of transformer-based models to advance host factor prediction, providing a framework extendable to orthogonal input modalities and other infectious diseases, enhancing our preparedness for current and future viral threats. AVAILABILITY: Source code is available at https://github.com/marsico-lab/TransFactor. A full reproducibility package, including code, trained models, and data, is archived on Zenodo (https://doi.org/10.5281/zenodo.16793684). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. AU - An, Y. AU - Bergant, V.* AU - Grünke, C.* AU - Bonnal, B.* AU - Henrici, A.* AU - Pichlmair, A. AU - Schubert, B. AU - Marsico, A. C1 - 75536 C2 - 58234 CY - Great Clarendon St, Oxford Ox2 6dp, England TI - TransFactor-Prediction of pro-viral SARS-CoV-2 host factors using a protein language model. JO - Bioinformatics VL - 41 IS - 9 PB - Oxford Univ Press PY - 2025 ER - TY - JOUR AB - MOTIVATION: Spatially resolved chromatin accessibility profiling offers the potential to investigate gene regulatory processes within the spatial context of tissues. However, current methods typically work at spot resolution, aggregating measurements from multiple cells, thereby obscuring cell-type-specific spatial patterns of accessibility. Spot deconvolution methods have been developed and extensively benchmarked for spatial transcriptomics, yet no dedicated methods exist for spatial chromatin accessibility, and it is unclear if RNA-based approaches are applicable to that modality. RESULTS: Here, we demonstrate that these RNA-based approaches can be applied to spot-based chromatin accessibility data by a systematic evaluation of five top-performing spatial transcriptomics deconvolution methods. To assess performance, we developed a simulation framework that generates both transcriptomic and accessibility spot data from dissociated single-cell and targeted multiomic datasets, enabling direct comparisons across both data modalities. Our results show that Cell2location and RCTD, in contrast to other methods, exhibit robust performance on spatial chromatin accessibility data, achieving accuracy comparable to RNA-based deconvolution. Generally, we observed that RNA-based deconvolution exhibited slightly better performance compared to chromatin accessibility-based deconvolution, especially for resolving rare cell types, indicating room for future development of specialized methods. In conclusion, our findings demonstrate that existing deconvolution methods can be readily applied to chromatin accessibility-based spatial data. Our work provides a simulation framework and establishes a performance baseline to guide the development and evaluation of methods optimized for spatial epigenomics. AVAILABILITY AND IMPLEMENTATION: All methods, simulation frameworks, peak selection strategies, analysis notebooks and scripts are available at https://github.com/theislab/deconvATAC. AU - Ouologuem, S. AU - Martens, L.D. AU - Schaar, A. AU - Shulman, M. AU - Gagneur, J. AU - Theis, F.J. C1 - 75143 C2 - 57836 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - i314-i322 TI - Spatial transcriptomics deconvolution methods generalize well to spatial chromatin accessibility data. JO - Bioinformatics VL - 41 IS - Supplement_1 PB - Oxford Univ Press PY - 2025 ER - TY - JOUR AB - SUMMARY: Spatial omics technologies are increasingly leveraged to characterize how disease disrupts tissue organization and cellular niches. While multiple methods to analyze spatial variation within a sample have been published, statistical and computational approaches to compare cell spatial organization across samples or conditions are mostly lacking. We present GraphCompass, a comprehensive set of omics-adapted graph analysis methods to quantitatively evaluate and compare the spatial arrangement of cells in samples representing diverse biological conditions. GraphCompass builds upon the Squidpy spatial omics toolbox and encompasses various statistical approaches to perform cross-condition analyses at the level of individual cell types, niches, and samples. Additionally, GraphCompass provides custom visualization functions that enable effective communication of results. We demonstrate how GraphCompass can be used to address key biological questions, such as how cellular organization and tissue architecture differ across various disease states and which spatial patterns correlate with a given pathological condition. GraphCompass can be applied to various popular omics techniques, including, but not limited to, spatial proteomics (e.g. MIBI-TOF), spot-based transcriptomics (e.g. 10× Genomics Visium), and single-cell resolved transcriptomics (e.g. Stereo-seq). In this work, we showcase the capabilities of GraphCompass through its application to three different studies that may also serve as benchmark datasets for further method development. With its easy-to-use implementation, extensive documentation, and comprehensive tutorials, GraphCompass is accessible to biologists with varying levels of computational expertise. By facilitating comparative analyses of cell spatial organization, GraphCompass promises to be a valuable asset in advancing our understanding of tissue function in health and disease. UNLABELLED:  . AU - Ali, M. AU - Kuijs, M. AU - Hediyeh-Zadeh, S. AU - Treis, T. AU - Hrovatin, K. AU - Palla, G. AU - Schaar, A. AU - Theis, F.J. C1 - 70944 C2 - 55982 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - i548-i557 TI - GraphCompass: Spatial metrics for differential analyses of cell organization across conditions. JO - Bioinformatics VL - 40 IS - Supplement_1 PB - Oxford Univ Press PY - 2024 ER - TY - JOUR AB - MOTIVATION: High dimensional single-cell mass cytometry data are confounded by unwanted covariance due to variations in cell size and staining efficiency, making analysis and interpretation challenging. RESULTS: We present RUCova, a novel method designed to address confounding factors in mass cytometry data. RUCova removes unwanted covariance from measured markers applying multivariate linear regression based on Surrogates of sources Unwanted Covariance (SUCs) and principal component analysis (PCA). We exemplify the use of RUCova and show that it effectively removes unwanted covariance while preserving genuine biological signals. Our results demonstrate the efficacy of RUCova in elucidating complex data patterns, facilitating the identification of activated signalling pathways, and improving the classification of important cell populations such as apoptotic cells. By providing a robust framework for data normalization and interpretation, RUCova enhances the accuracy and reliability of mass cytometry analyses, contributing to advances in our understanding of cellular biology and disease mechanisms. AVAILABILITY AND IMPLEMENTATION: The R package is available on https://github.com/molsysbio/RUCova. Detailed documentation, data, and the code required to reproduce the results are available on https://doi.org/10.5281/zenodo.10913464. SUPPLEMENTARY INFORMATION: Available at Bioinformatics online (PDF). AU - Astaburuaga-García, R.* AU - Sell, T.* AU - Mutlu, S.* AU - Sieber, A.* AU - Lauber, K. AU - Blüthgen, N.* C1 - 72500 C2 - 56635 CY - Great Clarendon St, Oxford Ox2 6dp, England TI - RUCova: Removal of unwanted covariance in mass cytometry data. JO - Bioinformatics VL - 40 IS - 11 PB - Oxford Univ Press PY - 2024 ER - TY - JOUR AB - MOTIVATION: Quantitative dynamical models facilitate the understanding of biological processes and the prediction of their dynamics. The parameters of these models are commonly estimated from experimental data. Yet, experimental data generated from different techniques do not provide direct information about the state of the system but a nonlinear (monotonic) transformation of it. For such semi-quantitative data, when this transformation is unknown, it is not apparent how the model simulations and the experimental data can be compared. RESULTS: We propose a versatile spline-based approach for the integration of a broad spectrum of semi-quantitative data into parameter estimation. We derive analytical formulas for the gradients of the hierarchical objective function and show that this substantially increases the estimation efficiency. Subsequently, we demonstrate that the method allows for the reliable discovery of unknown measurement transformations. Furthermore, we show that this approach can significantly improve the parameter inference based on semi-quantitative data in comparison to available methods. AVAILABILITY AND IMPLEMENTATION: Modelers can easily apply our method by using our implementation in the open-source Python Parameter EStimation TOolbox (pyPESTO) available at https://github.com/ICB-DCM/pyPESTO. AU - Dorešić, D. AU - Grein, S.* AU - Hasenauer, J. C1 - 70943 C2 - 55981 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - i558-i566 TI - Efficient parameter estimation for ODE models of cellular processes using semi-quantitative data. JO - Bioinformatics VL - 40 IS - Supplement_1 PB - Oxford Univ Press PY - 2024 ER - TY - JOUR AB - SUMMARY: Accurate clustering of mixed data, encompassing binary, categorical, and continuous variables, is vital for effective patient stratification in clinical questionnaire analysis. To address this need, we present longmixr, a comprehensive R package providing a robust framework for clustering mixed longitudinal data using finite mixture modeling techniques. By incorporating consensus clustering, longmixr ensures reliable and stable clustering results. Moreover, the package includes a detailed vignette that facilitates cluster exploration and visualization. AVAILABILITY AND IMPLEMENTATION: The R package is freely available at https://cran.r-project.org/package=longmixr with detailed documentation, including a case vignette, at https://cellmapslab.github.io/longmixr/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. AU - Hagenberg, J. AU - Budde, M.* AU - Pandeva, T. AU - Kondofersky, I. AU - Schaupp, S.K.* AU - Theis, F.J. AU - Schulze, T.G.* AU - Müller, N.S. AU - Heilbronner, U.* AU - Batra, R. AU - Knauer-Arloth, J. C1 - 70247 C2 - 55464 CY - Great Clarendon St, Oxford Ox2 6dp, England TI - longmixr: A tool for robust clustering of high-dimensional cross-sectional and longitudinal variables of mixed data types. JO - Bioinformatics VL - 40 IS - 4 PB - Oxford Univ Press PY - 2024 ER - TY - JOUR AB - MOTIVATION: Pangenome graphs offer a comprehensive way of capturing genomic variability across multiple genomes. However, current construction methods often introduce biases, excluding complex sequences or relying on references. The PanGenome Graph Builder (PGGB) addresses these issues. To date, though, there is no state-of-the-art pipeline allowing for easy deployment, efficient and dynamic use of available resources, and scalable usage at the same time. RESULTS: To overcome these limitations, we present nf-core/pangenome, a reference-unbiased approach implemented in Nextflow following nf-core's best practices. Leveraging biocontainers ensures portability and seamless deployment in HPC environments. Unlike PGGB, nf-core/pangenome distributes alignments across cluster nodes, enabling scalability. Demonstrating its efficiency, we constructed pangenome graphs for 1000 human chromosome 19 haplotypes and 2146 E. coli sequences, achieving a two to threefold speedup compared to PGGB without increasing greenhouse gas emissions. AVAILABILITY: Nf-core/pangenome is released under the MIT open-source license, available on GitHub and Zenodo, with documentation accessible at https://nf-co.re/pangenome/1.1.2/docs/usage. SUPPLEMENTARY: Supplementary data are available at Bioinformatics online. AU - Heumos, S.* AU - Heuer, M.L.* AU - Hanssen, F.* AU - Heumos, L. AU - Guarracino, A.* AU - Heringer, P.* AU - Ehmele, P. AU - Prins, P.* AU - Garrison, E.* AU - Nahnsen, S.* C1 - 72002 C2 - 56542 CY - Great Clarendon St, Oxford Ox2 6dp, England TI - Cluster-efficient pangenome graph construction with nf-core/pangenome. JO - Bioinformatics VL - 40 IS - 11 PB - Oxford Univ Press PY - 2024 ER - TY - JOUR AB - MOTIVATION: In recent years, many algorithms for inferring gene regulatory networks from single-cell transcriptomic data have been published. Several studies have evaluated their accuracy in estimating the presence of an interaction between pairs of genes. However, these benchmarking analyses do not quantify the algorithms' ability to capture structural properties of networks, which are fundamental, for example, for studying the robustness of a gene network to external perturbations. Here, we devise a three-step benchmarking pipeline called STREAMLINE that quantifies the ability of algorithms to capture topological properties of networks and identify hubs. RESULTS: To this aim, we use data simulated from different types of networks as well as experimental data from three different organisms. We apply our benchmarking pipeline to four inference algorithms and provide guidance on which algorithm should be used depending on the global network property of interest. AVAILABILITY AND IMPLEMENTATION: STREAMLINE is available at https://github.com/ScialdoneLab/STREAMLINE. The data generated in this study are available at https://doi.org/10.5281/zenodo.10710444. CONTACT: Direct inquiries should be addressed to the corresponding authors. SUPPLEMENTARY INFORMATION: Supplementary Information is available online. AU - Stock, M. AU - Popp, N. AU - Fiorentino, J. AU - Scialdone, A. C1 - 70543 C2 - 55662 CY - Great Clarendon St, Oxford Ox2 6dp, England TI - Topological benchmarking of algorithms to infer Gene Regulatory Networks from Single-Cell RNA-seq Data. JO - Bioinformatics VL - 40 IS - 5 PB - Oxford Univ Press PY - 2024 ER - TY - JOUR AB - MOTIVATION: Accurate prediction of RNA subcellular localisation plays an important role in understanding cellular processes and functions. Although post-transcriptional processes are governed by trans-acting RNA binding proteins (RBPs) through interaction with cis-regulatory RNA motifs, current methods do not incorporate RBP-binding information. RESULTS: In this paper, we propose DeepLocRNA, an interpretable deep-learning model that leverages a pre-trained multi-task RBP-binding prediction model to predict the subcellular localisation of RNA molecules via fine-tuning. We constructed DeepLocRNA using a comprehensive dataset with variant RNA types and evaluated it on the held-out dataset. Our model achieved state-of-the-art performance in predicting RNA subcellular localisation in mRNA and miRNA. It has also demonstrated great generalization capabilities, performing well on both human and mouse RNA. Additionally, a motif analysis was performed to enhance the interpretability of the model, highlighting signal factors that contributed to the predictions. The proposed model provides general and powerful prediction abilities for different RNA types and species, offering valuable insights into the localisation patterns of RNA molecules and contributing to our understanding of cellular processes at the molecular level. A user-friendly web server is available at: https://biolib.com/KU/DeepLocRNA/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. AU - Wang, J.* AU - Horlacher, M. AU - Cheng, L.* AU - Winther, O.* C1 - 69880 C2 - 55301 CY - Great Clarendon St, Oxford Ox2 6dp, England TI - DeepLocRNA: An interpretable deep learning model for predicting RNA subcellular localisation with domain-specific transfer-learning. JO - Bioinformatics VL - 40 IS - 2 PB - Oxford Univ Press PY - 2024 ER - TY - JOUR AB - MOTIVATION: Biological tissues are dynamic and highly organized. Multi-scale models are helpful tools to analyse and understand the processes determining tissue dynamics. These models usually depend on parameters that need to be inferred from experimental data to achieve a quantitative understanding, to predict the response to perturbations, and to evaluate competing hypotheses. However, even advanced inference approaches such as approximate Bayesian computation (ABC) are difficult to apply due to the computational complexity of the simulation of multi-scale models. Thus, there is a need for a scalable pipeline for modeling, simulating, and parameterizing multi-scale models of multi-cellular processes. RESULTS: Here, we present FitMultiCell, a computationally efficient and user-friendly open-source pipeline that can handle the full workflow of modeling, simulating, and parameterizing for multi-scale models of multi-cellular processes. The pipeline is modular and integrates the modeling and simulation tool Morpheus and the statistical inference tool pyABC. The easy integration of high-performance infrastructure allows to scale to computationally expensive problems. The introduction of a novel standard for the formulation of parameter inference problems for multi-scale models additionally ensures reproducibility and reusability. By applying the pipeline to multiple biological problems, we demonstrate its broad applicability, which will benefit in particular image-based systems biology. AVAILABILITY AND IMPLEMENTATION: FitMultiCell is available open-source at https://gitlab.com/fitmulticell/fit. AU - Alamoudi, E.* AU - Schälte, Y. AU - Müller, R.* AU - Starruß, J.* AU - Bundgaard, N.* AU - Graw, F.* AU - Brusch, L.* AU - Hasenauer, J. C1 - 68760 C2 - 54970 CY - Great Clarendon St, Oxford Ox2 6dp, England TI - FitMultiCell: simulating and parameterizing computational models of multi-scale and multi-cellular processes. JO - Bioinformatics VL - 39 IS - 11 PB - Oxford Univ Press PY - 2023 ER - TY - JOUR AB - MOTIVATION: Identifying regulatory regions in the genome is of great interest for understanding the epigenomic landscape in cells. One fundamental challenge in this context is to find the target genes whose expression is affected by the regulatory regions. A recent successful method is the Activity-By-Contact (ABC) model (Fulco et al., 2019) which scores enhancer-gene interactions based on enhancer activity and the contact frequency of an enhancer to its target gene. However, it describes regulatory interactions entirely from a gene's perspective, and does not account for all the candidate target genes of an enhancer. In addition, the ABC-model requires two types of assays to measure enhancer activity, which limits the applicability. Moreover, there is no implementation available that could allow for an integration with transcription factor (TF) binding information nor an efficient analysis of single-cell data. RESULTS: We demonstrate that the ABC-score can yield a higher accuracy by adapting the enhancer activity according to the number of contacts the enhancer has to its candidate target genes and also by considering all annotated transcription start sites of a gene. Further, we show that the model is comparably accurate with only one assay to measure enhancer activity. We combined our generalised ABC-model (gABC) with TF binding information and illustrate an analysis of a single-cell ATAC-seq data set of the human heart, where we were able to characterise cell type-specific regulatory interactions and predict gene expression based on transcription factor affinities. All executed processing steps are incorporated into our new computational pipeline STARE. AVAILABILITY: The software is available at https://github.com/schulzlab/STARE. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. AU - Hecker, D.* AU - Behjati Ardakani, F.* AU - Karollus, A.* AU - Gagneur, J. AU - Schulz, M.H.* C1 - 67338 C2 - 54178 TI - The adapted activity-by-contact model for enhancer-gene assignment and its application to single-cell data. JO - Bioinformatics VL - 39 IS - 2 PY - 2023 ER - TY - JOUR AB - MOTIVATION: Machine learning has shown extensive growth in recent years and is now routinely applied to sensitive areas. To allow appropriate verification of predictive models before deployment, models must be deterministic. Solely fixing all random seeds is not sufficient for deterministic machine learning, as major machine learning libraries default to the usage of nondeterministic algorithms based on atomic operations. RESULTS: Various machine learning libraries released deterministic counterparts to the nondeterministic algorithms. We evaluated the effect of these algorithms on determinism and runtime. Based on these results, we formulated a set of requirements for deterministic machine learning and developed a new software solution, the mlf-core ecosystem, which aids machine learning projects to meet and keep these requirements. We applied mlf-core to develop deterministic models in various biomedical fields including a single-cell autoencoder with TensorFlow, a PyTorch-based U-Net model for liver-tumor segmentation in computed tomography scans, and a liver cancer classifier based on gene expression profiles with XGBoost. AVAILABILITY AND IMPLEMENTATION: The complete data together with the implementations of the mlf-core ecosystem and use case models are available at https://github.com/mlf-core. AU - Heumos, L. AU - Ehmele, P.* AU - Kuhn Cuellar, L.* AU - Menden, K.* AU - Miller, E.* AU - Lemke, S.* AU - Gabernet, G.* AU - Nahnsen, S.* C1 - 67638 C2 - 53945 CY - Great Clarendon St, Oxford Ox2 6dp, England TI - mlf-core: A framework for deterministic machine learning. JO - Bioinformatics VL - 39 IS - 4 PB - Oxford Univ Press PY - 2023 ER - TY - JOUR AB - MOTIVATION: Federated Learning (FL) is gaining traction in various fields as it enables integrative data analysis without sharing sensitive data, such as in healthcare. However, the risk of data leakage caused by malicious attacks must be considered. In this study, we introduce a novel attack algorithm that relies on being able to compute sample means, sample covariances, and construct known linearly independent vectors on the data owner side. RESULTS: We show that these basic functionalities, which are available in several established FL frameworks, are sufficient to reconstruct privacy-protected data. Additionally, the attack algorithm is robust to defense strategies that involve adding random noise. We demonstrate the limitations of existing frameworks and propose potential defense strategies analyzing the implications of using differential privacy. The novel insights presented in this study will aid in the improvement of FL frameworks. AVAILABILITY AND IMPLEMENTATION: The code examples are provided at GitHub (https://github.com/manuhuth/Data-Leakage-From-Covariances.git). The CNSIM1 dataset, which we used in the manuscript, is available within the DSData R package (https://github.com/datashield/DSData/tree/main/data). AU - Huth, M. AU - Arruda, J.* AU - Gusinow, R. AU - Contento, L.* AU - Tacconelli, E.* AU - Hasenauer, J.* C1 - 68617 C2 - 54761 CY - Great Clarendon St, Oxford Ox2 6dp, England TI - Accessibility of covariance information creates vulnerability in Federated Learning frameworks. JO - Bioinformatics VL - 39 IS - 9 PB - Oxford Univ Press PY - 2023 ER - TY - JOUR AB - SUMMARY: mpwR is an R package for a standardized comparison of mass spectrometry (MS)-based proteomic label-free workflows recorded by data-dependent or data-independent spectral acquisition. The user-friendly design allows easy access to compare the influence of sample preparation procedures, combinations of liquid chromatography (LC)-MS setups, as well as intra- and inter-software differences on critical performance measures across an unlimited number of analyses. mpwR supports outputs of commonly used software for bottom-up proteomics, such as ProteomeDiscoverer, Spectronaut, MaxQuant, and DIA-NN. AVAILABILITY AND IMPLEMENTATION: mpwR is available as an open-source R package. Release versions can be accessed on CRAN (https://CRAN.R-project.org/package=mpwR) for all major operating systems. The development version is maintained on GitHub (https://github.com/okdll/mpwR) and full documentation with examples and workflow templates is provided via the package website (https://okdll.github.io/mpwR/). AU - Kardell, O. AU - Breimann, S.* AU - Hauck, S.M. C1 - 67858 C2 - 54336 CY - Great Clarendon St, Oxford Ox2 6dp, England TI - mpwR: An R package for comparing performance of mass spectrometry-based proteomic workflows. JO - Bioinformatics VL - 39 IS - 6 PB - Oxford Univ Press PY - 2023 ER - TY - JOUR AB - SUMMARY: Mechanistic models are important tools to describe and understand biological processes. However, they typically rely on unknown parameters, the estimation of which can be challenging for large and complex systems. pyPESTO is a modular framework for systematic parameter estimation, with scalable algorithms for optimization and uncertainty quantification. While tailored to ordinary differential equation problems, pyPESTO is broadly applicable to black-box parameter estimation problems. Besides own implementations, it provides a unified interface to various popular simulation and inference methods. AVAILABILITY AND IMPLEMENTATION: pyPESTO is implemented in Python, open-source under a 3-Clause BSD license. Code and documentation are available on GitHub (https://github.com/icb-dcm/pypesto). AU - Schälte, Y. AU - Fröhlich, F.* AU - Jost, P.J.* AU - Vanhoefer, J.* AU - Pathirana, D.* AU - Stapor, P. AU - Lakrisenko, P. AU - Wang, D. AU - Raimundez-Alvarez, E. AU - Merkt, S.* AU - Schmiester, L. AU - Städter, P. AU - Grein, S.* AU - Dudkin, E.* AU - Doresic, D.* AU - Weindl, D. AU - Hasenauer, J. C1 - 68911 C2 - 53763 CY - Great Clarendon St, Oxford Ox2 6dp, England TI - pyPESTO: A modular and scalable tool for parameter estimation for dynamic models. JO - Bioinformatics VL - 39 IS - 11 PB - Oxford Univ Press PY - 2023 ER - TY - JOUR AB - MOTIVATION: Somatic mutations are usually called by analysing the DNA sequence of a tumor sample in conjunction with a matched normal. However, a matched normal is not always available, for instance, in retrospective analysis or diagnostic settings. For such cases, tumor-only somatic variant calling tools need to be designed. Previously proposed approaches demonstrate inferior performance on whole genome sequencing (WGS) samples. RESULTS: We present the convolutional neural network-based approach called DeepSom for detecting somatic single nucleotide polymorphism (SNP) and short insertion and deletion (INDEL) variants in tumor WGS samples without a matched normal. We validate DeepSom by reporting its performance on 5 different cancer datasets. We also demonstrate that on WGS samples DeepSom outperforms previously proposed methods for tumor-only somatic variant calling. AVAILABILITY: DeepSom is available as a GitHub repository at https://github.com/heiniglab/DeepSom. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. AU - Vilov, S. AU - Heinig, M. C1 - 67203 C2 - 54214 CY - Great Clarendon St, Oxford Ox2 6dp, England TI - DeepSom: A CNN-based approach to somatic variant calling in WGS samples without a matched normal. JO - Bioinformatics VL - 39 IS - 1 PB - Oxford Univ Press PY - 2023 ER - TY - JOUR AB - MOTIVATION: Plasma ionization is rapidly gaining popularity for mass spectrometry (MS)-based studies of volatiles and aerosols. However, data from plasma ionization are delicate to interpret as competing ionization pathways in the plasma create numerous ion species. There is no tool for detection of adducts and in-source fragments from plasma ionization data yet, which makes data evaluation ambiguous. SUMMARY: We developed DBDIpy, a Python library for processing and formal analysis of untargeted, time-sensitive plasma ionization MS datasets. Its core functionality lies in the identification of in-source fragments and identification of rivaling ionization pathways of the same analytes in time-sensitive datasets. It further contains elementary functions for processing of untargeted metabolomics data and interfaces to an established ecosystem for analysis of MS data in Python. AVAILABILITY AND IMPLEMENTATION: DBDIpy is implemented in Python (Version ≥ 3.7) and can be downloaded from PyPI the Python package repository (https://pypi.org/project/DBDIpy) or from GitHub (https://github.com/leopold-weidner/DBDIpy). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. AU - Weidner, L. AU - Hemmler, D. AU - Rychlik, M.* AU - Schmitt-Kopplin, P. C1 - 67477 C2 - 54131 CY - Great Clarendon St, Oxford Ox2 6dp, England TI - DBDIpy: A Python library for processing of untargeted datasets from real-time plasma ionization mass spectrometry. JO - Bioinformatics VL - 39 IS - 2 PB - Oxford Univ Press PY - 2023 ER - TY - JOUR AB - SUMMARY: To allow the comprehensive histological analysis of the whole intestine, it is often rolled to a spiral before imaging. This Swiss-rolling technique facilitates robust experimental procedures, but it limits the possibilities to comprehend changes along the intestine. Here, we present IntestLine, a Shiny-based open-source application for processing imaging data of (rolled) intestinal tissues and subsequent mapping onto a line. The visualization of the mapped data facilitates the assessment of the whole intestine in both proximal-distal and serosa-luminal axis, and enables the observation of location-specific cell types and markers. Accordingly, IntestLine can serve as a tool to characterize the intestine in multi-modal imaging studies. AVAILABILITY AND IMPLEMENTATION: Source code can be found at Zenodo (https://doi.org/10.5281/zenodo.7081864) and GitHub (https://github.com/SchlitzerLab/IntestLine). AU - Yuzeir, A.* AU - Bejarano, D.A.* AU - Grein, S.* AU - Hasenauer, J. AU - Schlitzer, A.* AU - Yu, J.* C1 - 67705 C2 - 54012 CY - Great Clarendon St, Oxford Ox2 6dp, England TI - IntestLine: A shiny-based application to map the rolled intestinal tissue onto a line. JO - Bioinformatics VL - 39 IS - 4 PB - Oxford Univ Press PY - 2023 ER - TY - JOUR AB - This paper presents maplet, an open-source R package for the creation of highly customizable, fully reproducible statistical pipelines for metabolomics data analysis. It builds on the SummarizedExperiment data structure to create a centralized pipeline framework for storing data, analysis steps, results, and visualizations. maplet's key design feature is its modularity, which offers several advantages, such as ensuring code quality through the maintenance of individual functions and promoting collaborative development by removing technical barriers to code contribution. With over 90 functions, the package includes a wide range of functionalities, covering many widely used statistical approaches and data visualization techniques. AVAILABILITY: The maplet package is implemented in R and freely available at https://github.com/krumsieklab/maplet. AU - Chetnik, K.* AU - Benedetti, E.* AU - Gomari, D.P. AU - Schweickart, A.* AU - Batra, R.* AU - Buyukozkan, M.* AU - Wang, Z.* AU - Arnold, M. AU - Zierer, J.* AU - Suhre, K.* AU - Krumsiek, J.* C1 - 63448 C2 - 51537 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - 1168-1170 TI - maplet: An extensible R toolbox for modular and reproducible metabolomics pipelines. JO - Bioinformatics VL - 38 IS - 4 PB - Oxford Univ Press PY - 2022 ER - TY - JOUR AB - MOTIVATION: A key process in anti-viral adaptive immunity is that the Human Leukocyte Antigen system (HLA) presents epitopes as Major Histocompatibility Complex I (MHC I) protein-peptide complexes on cell surfaces and in this way alerts CD8+ cytotoxic T-Lymphocytes (CTLs). This pathway exerts strong selection pressure on viruses, favoring viral mutants that escape recognition by the HLA/CTL system. Naturally, such immune escape mutations often emerge in highly variable viruses, e.g. HIV or HBV, as HLA-associated mutations (HAMs), specific to the hosts MHC I proteins. The reliable identification of HAMs is not only important for understanding viral genomes and their evolution, but it also impacts the development of broadly effective anti-viral treatments and vaccines against variable viruses. By their very nature, HAMs are amenable to detection by statistical methods in paired sequence/HLA data. However, HLA alleles are very polymorphic in the human host population which makes the available data relatively sparse and noisy. Under these circumstances, one way to optimize HAM detection is to integrate all relevant information in a coherent model. Bayesian inference offers a principled approach to achieve this. RESULTS: We present a new Bayesian regression model for the detection of HAMs that integrates a sparsity-inducing prior, epitope predictions, and phylogenetic bias assessment, and that yields easily interpretable quantitative information on HAM candidates. The model predicts experimentally confirmed HAMs as having high posterior probabilities, and it performs well in comparison to state-of-the-art models for several data sets from individuals infected with HBV, HDV, and HIV. AVAILABILITY: The source code of this software is available at https://github.com/HAMdetector/Escape.jl under a permissive MIT license. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. AU - Habermann, D.* AU - Kharimzadeh, H.* AU - Walker, A.* AU - Li, Y.* AU - Yang, R.* AU - Kaiser, R.* AU - Brumme, Z.L.* AU - Timm, J.* AU - Roggendorf, M. AU - Hoffmann, D.* C1 - 64525 C2 - 52250 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - 2428-2436 TI - HAMdetector: A Bayesian regression model that integrates information to detect HLA-associated mutations. JO - Bioinformatics VL - 38 IS - 9 PB - Oxford Univ Press PY - 2022 ER - TY - JOUR AB - MOTIVATION: Pathway annotation tools are indispensable for the interpretation of a wide range of experiments in life sciences. Network-based algorithms have recently been developed which are more sensitive than traditional overlap-based algorithms, but there is still a lack of good online tools for network-based pathway analysis. RESULTS: We present PathwAX II-a pathway analysis web tool based on network crosstalk analysis using the BinoX algorithm. It offers several new features compared to the first version, including interactive graphical network visualization of the crosstalk between a query gene set and an enriched pathway, and the addition of Reactome pathways. AVAILABILITY: PathwAX II is available at http://pathwax.sbc.su.se. SUPPLEMENTARY INFORMATION: Supplementary materials are available at Bioinformatics online. AU - Ogris, C. AU - Castresana-Aguirre, M.* AU - Sonnhammer, E.L.L.* C1 - 64653 C2 - 52370 SP - 2659-2660 TI - PathwAX II: Network-based pathway analysis with interactive visualization of network crosstalk. JO - Bioinformatics VL - 38 IS - 9 PY - 2022 ER - TY - JOUR AB - SUMMARY: We present MobilityTransformR, an R/Bioconductor package for the effective mobility scaling of capillary zone electrophoresis-mass spectrometry (CE-MS) data. It uses functionality from different R packages that are frequently used for data processing and analysis in MS-based metabolomics workflows, allowing the subsequent use of reproducible transformed CE-MS data in existing workflows. AVAILABILITY AND IMPLEMENTATION: MobilityTransformR is implemented in R (Version > = 4.2) and can be downloaded directly from the Bioconductor database (https://bioconductor.org/packages/MobilityTransformR) or GitHub (https://github.com/LiesaSalzer/MobilityTransformR). SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. AU - Salzer, L. AU - Witting, M. AU - Schmitt-Kopplin, P. C1 - 65669 C2 - 52880 SP - 4044-4045 TI - MobilityTransformR: An R package for effective mobility transformation of CE-MS data. JO - Bioinformatics VL - 38 IS - 16 PY - 2022 ER - TY - JOUR AB - SUMMARY: Ordinary differential equation models facilitate the understanding of cellular signal transduction and other biological processes. However, for large and comprehensive models, the computational cost of simulating or calibrating can be limiting. AMICI is a modular toolbox implemented in C ++/Python/MATLAB that provides efficient simulation and sensitivity analysis routines tailored for scalable, gradient-based parameter estimation and uncertainty quantification. AVAILABILITY: AMICI is published under the permissive BSD-3-Clause license with source code publicly available on https://github.com/AMICI-dev/AMICI. Citeable releases are archived on Zenodo. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. AU - Fröhlich, F.* AU - Weindl, D. AU - Schälte, Y. AU - Pathirana, D.* AU - Paszkowski, L.* AU - Lines, G.T.* AU - Stapor, P. AU - Hasenauer, J. C1 - 61728 C2 - 50413 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - 3676-3677 TI - AMICI: High-performance sensitivity analysis for large ordinary differential equation models. JO - Bioinformatics VL - 37 IS - 20 PB - Oxford Univ Press PY - 2021 ER - TY - JOUR AB - MOTIVATION: Unknown parameters of dynamical models are commonly estimated from experimental data. However, while various efficient optimization and uncertainty analysis methods have been proposed for quantitative data, methods for qualitative data are rare and suffer from bad scaling and convergence. RESULTS: Here, we propose an efficient and reliable framework for estimating the parameters of ordinary differential equation models from qualitative data. In this framework, we derive a semi-analytical algorithm for gradient calculation of the optimal scaling method developed for qualitative data. This enables the use of efficient gradient-based optimization algorithms. We demonstrate that the use of gradient information improves performance of optimization and uncertainty quantification on several application examples. On average, we achieve a speedup of more than one order of magnitude compared to gradient-free optimization. Additionally, in some examples, the gradient-based approach yields substantially improved objective function values and quality of the fits. Accordingly, the proposed framework substantially improves the parameterization of models from qualitative data. AVAILABILITY: The proposed approach is implemented in the open-source Python Parameter EStimation TOolbox (pyPESTO). pyPESTO is available at https://github.com/ICB-DCM/pyPESTO. All application examples and code to reproduce this study are available at https://doi.org/10.5281/zenodo.4507613. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. AU - Schmiester, L. AU - Weindl, D. AU - Hasenauer, J. C1 - 62577 C2 - 50954 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - 4493-4500 TI - Efficient gradient-based parameter estimation for dynamic models using qualitative data. JO - Bioinformatics VL - 37 IS - 23 PB - Oxford Univ Press PY - 2021 ER - TY - JOUR AB - MOTIVATION: Dimensionality reduction is a key step in the analysis of single-cell RNA-sequencing data. It produces a low-dimensional embedding for visualization and as a calculation base for downstream analysis. Nonlinear techniques are most suitable to handle the intrinsic complexity of large, heterogeneous single-cell data. However, with no linear relation between gene and embedding coordinate, there is no way to extract the identity of genes driving any cell's position in the low-dimensional embedding, making it difficult to characterize the underlying biological processes. RESULTS: In this article, we introduce the concepts of local and global gene relevance to compute an equivalent of principal component analysis loadings for non-linear low-dimensional embeddings. Global gene relevance identifies drivers of the overall embedding, while local gene relevance identifies those of a defined sub-region. We apply our method to single-cell RNA-seq datasets from different experimental protocols and to different low-dimensional embedding techniques. This shows our method's versatility to identify key genes for a variety of biological processes. AVAILABILITY AND IMPLEMENTATION: To ensure reproducibility and ease of use, our method is released as part of destiny 3.0, a popular R package for building diffusion maps from single-cell transcriptomic data. It is readily available through Bioconductor. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. AU - Angerer, P. AU - Fischer, D.S. AU - Theis, F.J. AU - Scialdone, A. AU - Marr, C. C1 - 58678 C2 - 48293 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - 4291-4295 TI - Automatic identification of relevant genes from low-dimensional embeddings of single-cell RNA-seq data. JO - Bioinformatics VL - 36 IS - 15 PB - Oxford Univ Press PY - 2020 ER - TY - JOUR AB - MOTIVATION: Conceptually, epitope-based vaccine design poses two distinct problems: (i) selecting the best epitopes to elicit the strongest possible immune response and (ii) arranging and linking them through short spacer sequences to string-of-beads vaccines, so that their recovery likelihood during antigen processing is maximized. Current state-of-the-art approaches solve this design problem sequentially. Consequently, such approaches are unable to capture the inter-dependencies between the two design steps, usually emphasizing theoretical immunogenicity over correct vaccine processing, thus resulting in vaccines with less effective immunogenicity in vivo. RESULTS: In this work, we present a computational approach based on linear programming, called JessEV, that solves both design steps simultaneously, allowing to weigh the selection of a set of epitopes that have great immunogenic potential against their assembly into a string-of-beads construct that provides a high chance of recovery. We conducted Monte Carlo cleavage simulations to show that a fixed set of epitopes often cannot be assembled adequately, whereas selecting epitopes to accommodate proper cleavage requirements substantially improves their recovery probability and thus the effective immunogenicity, pathogen and population coverage of the resulting vaccines by at least 2-fold. AVAILABILITY AND IMPLEMENTATION: The software and the data analyzed are available at https://github.com/SchubertLab/JessEV. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. AU - Dorigatti, E. AU - Schubert, B. C1 - 60958 C2 - 49749 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - i643-i650 TI - Joint epitope selection and spacer design for string-of-beads vaccines. JO - Bioinformatics VL - 36 PB - Oxford Univ Press PY - 2020 ER - TY - JOUR AB - Motivation: High-throughput phenomic projects generate complex data from small treatment and large control groups that increase the power of the analyses but introduce variation over time. A method is needed to utlize a set of temporally local controls that maximizes analytic power while minimizing noise from unspecified environmental factors.Results: Here we introduce 'soft windowing', a methodological approach that selects a window of time that includes the most appropriate controls for analysis. Using phenotype data from the International Mouse Phenotyping Consortium (IMPC), adaptive windows were applied such that control data collected proximally to mutants were assigned the maximal weight, while data collected earlier or later had less weight. We applied this method to IMPC data and compared the results with those obtained from a standard non-windowed approach. Validation was performed using a resampling approach in which we demonstrate a 10% reduction of false positives from 2.5 million analyses. We applied the method to our production analysis pipeline that establishes genotype-phenotype associations by comparing mutant versus control data. We report an increase of 30% in significant P-values, as well as linkage to 106 versus 99 disease models via phenotype overlap with the soft-windowed and non-windowed approaches, respectively, from a set of 2082 mutant mouse lines. Our method is generalizable and can benefit large-scale human phenomic projects such as the UK Biobank and the All of Us resources. AU - Haselimashhadi, H.* AU - Mason, J.C.* AU - Muñoz-Fuentes, V.* AU - López-Gómez, F.* AU - Babalola, K.* AU - Acar, E.F.* AU - Kumar, V.* AU - White, J.* AU - Flenniken, A.M.* AU - King, R.* AU - Straiton, E.* AU - Seavitt, J.R.* AU - Gaspero, A.* AU - Garza, A.* AU - Christianson, A.E.* AU - Hsu, C.-W.* AU - Reynolds, C.L.* AU - Lanza, D.G.* AU - Lorenzo, I.* AU - Green, J.R.* AU - Gallegos, J.J.* AU - Bohat, R.* AU - Samaco, R.C.* AU - Veeraragavan, S.* AU - Kim, J.K.* AU - Miller, G. AU - Fuchs, H. AU - Garrett, L. AU - Becker, L. AU - Kang, Y.K.* AU - Clary, D.* AU - Cho, S.Y.* AU - Tamura, M.* AU - Tanaka, N.* AU - Soo, K.D.* AU - Bezginov, A.* AU - About, G.B.* AU - Champy, M.-F.* AU - Vasseur, L.* AU - Leblanc, S.* AU - Meziane, H.* AU - Selloum, M.* AU - Reilly, P.T.* AU - Spielmann, N. AU - Maier, H. AU - Gailus-Durner, V. AU - Sorg, T.* AU - Hiroshi, M.* AU - Yuichi, O.* AU - Heaney, J.D.* AU - Dickinson, M.E.* AU - Wurst, W. AU - Tocchini-Valentini, G.P.* AU - Lloyd, K.C.K.* AU - McKerlie, C.* AU - Seong, J.K.* AU - Herault, Y.* AU - Hrabě de Angelis, M. AU - Brown, S.D.M.* AU - Smedley, D.* AU - Flicek, P.* AU - Mallon, A.-M.* AU - Parkinson, H.* AU - Meehan, T.F.* C1 - 57037 C2 - 47490 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - 1492-1500 TI - Soft windowing application to improve analysis of high-throughput phenotyping data. JO - Bioinformatics VL - 36 IS - 5 PB - Oxford Univ Press PY - 2020 ER - TY - JOUR AB - MOTIVATION: While generative models have shown great success in sampling high-dimensional samples conditional on low-dimensional descriptors (stroke thickness in MNIST, hair color in CelebA, speaker identity in WaveNet), their generation out-of-distribution poses fundamental problems due to the difficulty of learning compact joint distribution across conditions. The canonical example of the conditional variational autoencoder (CVAE), for instance, does not explicitly relate conditions during training and, hence, has no explicit incentive of learning such a compact representation. RESULTS: We overcome the limitation of the CVAE by matching distributions across conditions using maximum mean discrepancy in the decoder layer that follows the bottleneck. This introduces a strong regularization both for reconstructing samples within the same condition and for transforming samples across conditions, resulting in much improved generalization. As this amount to solving a style-transfer problem, we refer to the model as transfer VAE (trVAE). Benchmarking trVAE on high-dimensional image and single-cell RNA-seq, we demonstrate higher robustness and higher accuracy than existing approaches. We also show qualitatively improved predictions by tackling previously problematic minority classes and multiple conditions in the context of cellular perturbation response to treatment and disease based on high-dimensional single-cell gene expression data. For generic tasks, we improve Pearson correlations of high-dimensional estimated means and variances with their ground truths from 0.89 to 0.97 and 0.75 to 0.87, respectively. We further demonstrate that trVAE learns cell-type-specific responses after perturbation and improves the prediction of most cell-type-specific genes by 65%. AVAILABILITY AND IMPLEMENTATION: The trVAE implementation is available via github.com/theislab/trvae. The results of this article can be reproduced via github.com/theislab/trvae_reproducibility. AU - Lotfollahi, M. AU - Naghipourfar, M. AU - Theis, F.J. AU - Wolf, F.A. C1 - 60957 C2 - 49754 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - i610-i617 TI - Conditional out-of-distribution generation for unpaired data using transfer VAE. JO - Bioinformatics VL - 36 PB - Oxford Univ Press PY - 2020 ER - TY - JOUR AB - MOTIVATION: Strand-seq is a specialized single-cell DNA sequencing technique centered around the directionality of single-stranded DNA. Computational tools for Strand-seq analyses must capture the strand-specific information embedded in these data. RESULTS: Here we introduce breakpointR, an R/Bioconductor package specifically tailored to process and interpret single-cell strand-specific sequencing data obtained from Strand-seq. We developed breakpointR to detect local changes in strand directionality of aligned Strand-seq data, to enable fine-mapping of sister chromatid exchanges, germline inversion and to support global haplotype assembly. Given the broad spectrum of Strand-seq applications we expect breakpointR to be an important addition to currently available tools and extend the accessibility of this novel sequencing technique. AVAILABILITY: R/Bioconductor package https://bioconductor.org/packages/breakpointR. AU - Porubsky, D.* AU - Sanders, A.D.* AU - Taudt, A. AU - Colomé-Tatché, M. AU - Lansdorp, P.M.* AU - Guryev, V.* C1 - 57811 C2 - 47908 SP - 1260-1261 TI - breakpointR: An R/Bioconductor package to localize strand state changes in Strand-seq data. JO - Bioinformatics VL - 36 IS - 4 PY - 2020 ER - TY - JOUR AB - Motivation: Approximate Bayesian computation (ABC) is an increasingly popular method for likelihood-free parameter inference in systems biology and other fields of research, as it allows analyzing complex stochastic models. However, the introduced approximation error is often not clear. It has been shown that ABC actually gives exact inference under the implicit assumption of a measurement noise model. Noise being common in biological systems, it is intriguing to exploit this insight. But this is difficult in practice, as ABC is in general highly computationally demanding. Thus, the question we want to answer here is how to efficiently account for measurement noise in ABC.Results: We illustrate exemplarily how ABC yields erroneous parameter estimates when neglecting measurement noise. Then, we discuss practical ways of correctly including the measurement noise in the analysis. We present an efficient adaptive sequential importance sampling-based algorithm applicable to various model types and noise models. We test and compare it on several models, including ordinary and stochastic differential equations, Markov jump processes and stochastically interacting agents, and noise models including normal, Laplace and Poisson noise. We conclude that the proposed algorithm could improve the accuracy of parameter estimates for a broad spectrum of applications. AU - Schälte, Y. AU - Hasenauer, J. C1 - 59641 C2 - 48905 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - 551-559 TI - Efficient exact inference for dynamical systems with noisy measurements using sequential approximate Bayesian computation. JO - Bioinformatics VL - 36 IS - 1 PB - Oxford Univ Press PY - 2020 ER - TY - JOUR AB - Motivation: Mechanistic models of biochemical reaction networks facilitate the quantitative understanding of biological processes and the integration of heterogeneous datasets. However, some biological processes require the consideration of comprehensive reaction networks and therefore large-scale models. Parameter estimation for such models poses great challenges, in particular when the data are on a relative scale.Results: Here, we propose a novel hierarchical approach combining (i) the efficient analytic evaluation of optimal scaling, offset and error model parameters with (ii) the scalable evaluation of objective function gradients using adjoint sensitivity analysis. We evaluate the properties of the methods by parameterizing a pan-cancer ordinary differential equation model (>1000 state variables, >4000 parameters) using relative protein, phosphoprotein and viability measurements. The hierarchical formulation improves optimizer performance considerably. Furthermore, we show that this approach allows estimating error model parameters with negligible computational overhead when no experimental estimates are available, providing an unbiased way to weight heterogeneous data. Overall, our hierarchical formulation is applicable to a wide range of models, and allows for the efficient parameterization of large-scale models based on heterogeneous relative measurements. AU - Schmiester, L. AU - Schälte, Y. AU - Fröhlich, F. AU - Hasenauer, J. AU - Weindl, D. C1 - 56651 C2 - 47220 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - 594-602 TI - Efficient parameterization of large-scale dynamic models based on relative measurements. JO - Bioinformatics VL - 36 IS - 2 PB - Oxford Univ Press PY - 2020 ER - TY - JOUR AB - MOTIVATION: Intercellular communication plays an essential role in multicellular organisms and several algorithms to analyze it from single-cell transcriptional data have been recently published, but the results are often hard to visualize and interpret. RESULTS: We developed Cell cOmmunication exploration with MUltiplex NETworks (COMUNET), a tool that streamlines the interpretation of the results from cell-cell communication analyses. COMUNET uses multiplex networks to represent and cluster all potential communication patterns between cell types. The algorithm also enables the search for specific patterns of communication and can perform comparative analysis between two biological conditions. To exemplify its use, here we apply COMUNET to investigate cell communication patterns in single-cell transcriptomic datasets from mouse embryos and from an acute myeloid leukemia patient at diagnosis and after treatment. AVAILABILITY AND IMPLEMENTATION: Our algorithm is implemented in an R package available from https://github.com/ScialdoneLab/COMUNET, along with all the code to perform the analyses reported here. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. AU - Solovey, M. AU - Scialdone, A. C1 - 59100 C2 - 48570 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - 4296-4300 TI - COMUNET: A tool to explore and visualize intercellular communication. JO - Bioinformatics VL - 36 IS - 15 PB - Oxford Univ Press PY - 2020 ER - TY - JOUR AB - Associations of metabolomics data with phenotypic outcomes are expected to span functional modules, which are defined as sets of correlating metabolites that are coordinately regulated. Moreover, these associations occur at different scales, from entire pathways to only a few metabolites; an aspect that has not been addressed by previous methods. Here, we present MoDentify, a free R package to identify regulated modules in metabolomics networks at different layers of resolution. Importantly, MoDentify shows higher statistical power than classical association analysis. Moreover, the package offers direct interactive visualization of the results in Cytoscape. We present an application example using complex, multifluid metabolomics data. Due to its generic character, the method is widely applicable to other types of data. AU - Do, K.T. AU - Rasp, D.J.N.P. AU - Kastenmüller, G. AU - Suhre, K. AU - Krumsiek, J. C1 - 53974 C2 - 45162 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - 532-534 TI - MoDentify: Phenotype-driven module identification in metabolomics networks at different resolutions. JO - Bioinformatics VL - 35 IS - 3 PB - Oxford Univ Press PY - 2019 ER - TY - JOUR AB - Motivation Very low-depth sequencing has been proposed as a cost-effective approach to capture low-frequency and rare variation in complex trait association studies. However, a full characterization of the genotype quality and association power for very low-depth sequencing designs is still lacking.Results We perform cohort-wide whole-genome sequencing (WGS) at low depth in 1239 individuals (990 at 1x depth and 249 at 4x depth) from an isolated population, and establish a robust pipeline for calling and imputing very low-depth WGS genotypes from standard bioinformatics tools. Using genotyping chip, whole-exome sequencing (75x depth) and high-depth (22x) WGS data in the same samples, we examine in detail the sensitivity of this approach, and show that imputed 1x WGS recapitulates 95.2% of variants found by imputed GWAS with an average minor allele concordance of 97% for common and low-frequency variants. In our study, 1x further allowed the discovery of 140844 true low-frequency variants with 73% genotype concordance when compared to high-depth WGS data. Finally, using association results for 57 quantitative traits, we show that very low-depth WGS is an efficient alternative to imputed GWAS chip designs, allowing the discovery of up to twice as many true association signals than the classical imputed GWAS design.Availability and implementation The HELIC genotype and WGS datasets have been deposited to the European Genome-phenome Archive (https://www.ebi.ac.uk/ega/home): EGAD00010000518; EGAD00010000522; EGAD00010000610; EGAD00001001636, EGAD00001001637. The peakplotter software is available at https://github.com/wtsi-team144/peakplotter, the transformPhenotype app can be downloaded at https://github.com/wtsi-team144/transformPhenotype.Supplementary informationSupplementary data are available at Bioinformatics online. AU - Gilly, A. AU - Southam, L.* AU - Suveges, D.* AU - Kuchenbaecker, K.* AU - Moore, R.* AU - Melloni, G.E.M.* AU - Hatzikotoulas, K. AU - Farmaki, A.E.* AU - Ritchie, G.* AU - Schwartzentruber, J.* AU - Danecek, P.* AU - Kilian, B.* AU - Pollard, M.O.* AU - Ge, X.* AU - Tsafantakis, E.* AU - Dedoussis, G.* AU - Zeggini, E. C1 - 56392 C2 - 47001 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - 2555-2561 TI - Very low-depth whole-genome sequencing in complex trait association studies. JO - Bioinformatics VL - 35 IS - 15 PB - Oxford Univ Press PY - 2019 ER - TY - JOUR AB - Motivation The identification of protein targets of novel compounds is essential to understand compounds' mechanisms of action leading to biological effects. Experimental methods to determine these protein targets are usually slow, costly and time consuming. Computational tools have recently emerged as cheaper and faster alternatives that allow the prediction of targets for a large number of compounds.Results Here, we present HitPickV2, a novel ligand-based approach for the prediction of human druggable protein targets of multiple compounds. For each query compound, HitPickV2 predicts up to 10 targets out of 2739 human druggable proteins. To that aim, HitPickV2 identifies the closest, structurally similar compounds in a restricted space within a vast chemical-protein interaction area, until 10 distinct protein targets are found. Then, HitPickV2 scores these 10 targets based on three parameters of the targets in such space: the Tanimoto coefficient (Tc) between the query and the most similar compound interacting with the target, a target rank that considers Tc and Laplacian-modified naive Bayesian target models scores and a novel parameter introduced in HitPickV2, the number of compounds interacting with each target (occur). We present the performance results of HitPickV2 in cross-validation as well as in an external dataset.Availability and implementation HitPickV2 is available in www.hitpickv2.com.Supplementary informationSupplementary data are available at Bioinformatics online. AU - Hamad, S. AU - Adornetto, G. AU - Naveja, J.J. AU - Ravindranath, A.C. AU - Raffler, J. AU - Campillos, M. C1 - 54213 C2 - 45443 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - 1239-1240 TI - HitPickV2: A web server to predict targets of chemical compounds. JO - Bioinformatics VL - 35 IS - 7 PB - Oxford Univ Press PY - 2019 ER - TY - JOUR AB - Motivation: Dynamic models are used in systems biology to study and understand cellular processes like gene regulation or signal transduction. Frequently, ordinary differential equation (ODE) models are used to model the time and dose dependency of the abundances of molecular compounds as well as interactions and translocations. A multitude of computational approaches, e.g. for parameter estimation or uncertainty analysis have been developed within recent years. However, many of these approaches lack proper testing in application settings because a comprehensive set of benchmark problems is yet missing.Results: We present a collection of 20 benchmark problems in order to evaluate new and existing methodologies, where an ODE model with corresponding experimental data is referred to as problem. In addition to the equations of the dynamical system, the benchmark collection provides observation functions as well as assumptions about measurement noise distributions and parameters. The presented benchmark models comprise problems of different size, complexity and numerical demands. Important characteristics of the models and methodological requirements are summarized, estimated parameters are provided, and some example studies were performed for illustrating the capabilities of the presented benchmark collection. AU - Hass, H.* AU - Loos, C. AU - Raimundez-Alvarez, E. AU - Timmer, J.* AU - Hasenauer, J. AU - Kreutz, C.* C1 - 55661 C2 - 46378 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - 3073-3082 TI - Benchmark problems for dynamic modeling of intracellular processes. JO - Bioinformatics VL - 35 IS - 17 PB - Oxford Univ Press PY - 2019 ER - TY - JOUR AB - A Summary: Despite their fundamental role in various biological processes, the analysis of small RNA sequencing data remains a challenging task. Major obstacles arise when short RNA sequences map to multiple locations in the genome, align to regions that are not annotated or underwent post-transcriptional changes which hamper accurate mapping. In order to tackle these issues, we present a novel profiling strategy that circumvents the need for read mapping to a reference genome by utilizing the actual read sequences to determine expression intensities. After differential expression analysis of individual sequence counts, significant sequences are annotated against user defined feature databases and clustered by sequence similarity. This strategy enables a more comprehensive and concise representation of small RNA populations without any data loss or data distortion. AU - Jeske, T. AU - Huypens, P. AU - Stirm, L. AU - Höckele, S. AU - Wurmser, C.M.* AU - Böhm, A. AU - Weigert, C. AU - Staiger, H. AU - Klein, C.* AU - Beckers, J. AU - Hastreiter, M. C1 - 56385 C2 - 46995 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - 4834-4836 TI - DEUS: An R package for accurate small RNA profiling based on differential expression of unique sequences. JO - Bioinformatics VL - 35 IS - 22 PB - Oxford Univ Press PY - 2019 ER - TY - JOUR AB - MOTIVATION: Kinetic models contain unknown parameters that are estimated by optimizing the fit to experimental data. This task can be computationally challenging due to the presence of local optima and ill-conditioning. While a variety of optimization methods have been suggested to surmount these issues, it is difficult to choose the best one for a given problem a priori. A systematic comparison of parameter estimation methods for problems with tens to hundreds of optimization variables is currently missing, and smaller studies provided contradictory findings. RESULTS: We use a collection of benchmarks to evaluate the performance of two families of optimization methods: (i) multi-starts of deterministic local searches and (ii) stochastic global optimization metaheuristics; the latter may be combined with deterministic local searches, leading to hybrid methods. A fair comparison is ensured through a collaborative evaluation and a consideration of multiple performance metrics. We discuss possible evaluation criteria to assess the trade-off between computational efficiency and robustness. Our results show that, thanks to recent advances in the calculation of parametric sensitivities, a multi-start of gradient-based local methods is often a successful strategy, but a better performance can be obtained with a hybrid metaheuristic. The best performer combines a global scatter search metaheuristic with an interior point local method, provided with gradients estimated with adjoint-based sensitivities. We provide an implementation of this method to render it available to the scientific community. AVAILABILITY AND IMPLEMENTATION: The code to reproduce the results is provided as Supplementary Material and is available at Zenodo https://doi.org/10.5281/zenodo.1304034. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. AU - Villaverde, A.F.* AU - Fröhlich, F. AU - Weindl, D. AU - Hasenauer, J. AU - Banga, J.R.* C1 - 55598 C2 - 46417 SP - 830-838 TI - Benchmarking optimization methods for parameter estimation in large kinetic models. JO - Bioinformatics VL - 35 IS - 5 PY - 2019 ER - TY - JOUR AB - Motivation: Mathematical models have become standard tools for the investigation of cellular processes and the unraveling of signal processing mechanisms. The parameters of these models are usually derived from the available data using optimization and sampling methods. However, the efficiency of these methods is limited by the properties of the mathematical model, e.g. nonidentifiabilities, and the resulting posterior distribution. In particular, multi-modal distributions with long valleys or pronounced tails are difficult to optimize and sample. Thus, the developement or improvement of optimization and sampling methods is subject to ongoing research. Results: We suggest a region-based adaptive parallel tempering algorithm which adapts to the problem-specific posterior distributions, i.e. modes and valleys. The algorithm combines several established algorithms to overcome their individual shortcomings and to improve sampling efficiency. We assessed its properties for established benchmark problems and two ordinary differential equation models of biochemical reaction networks. The proposed algorithm outperformed state-of-the-art methods in terms of calculation efficiency and mixing. Since the algorithm does not rely on a specific problem structure, but adapts to the posterior distribution, it is suitable for a variety of model classes. AU - Ballnus, B. AU - Schaper, S.* AU - Theis, F.J. AU - Hasenauer, J. C1 - 54057 C2 - 45261 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - 494-501 TI - Bayesian parameter estimation for biochemical reaction networks using region-based adaptive parallel tempering. JO - Bioinformatics VL - 34 IS - 13 PB - Oxford Univ Press PY - 2018 ER - TY - JOUR AB - Likelihood-free methods are often required for inference in systems biology. While approximate Bayesian computation (ABC) provides a theoretical solution, its practical application has often been challenging due to its high computational demands. To scale likelihood-free inference to computationally demanding stochastic models, we developed pyABC: a distributed and scalable ABC-Sequential Monte Carlo (ABC-SMC) framework. It implements a scalable, runtime-minimizing parallelization strategy for multi-core and distributed environments scaling to thousands of cores. The framework is accessible to non-expert users and also enables advanced users to experiment with and to custom implement many options of ABC-SMC schemes, such as acceptance threshold schedules, transition kernels and distance functions without alteration of pyABC's source code. pyABC includes a web interface to visualize ongoing and finished ABC-SMC runs and exposes an API for data querying and post-processing. AU - Klinger, E. AU - Rickert, D. AU - Hasenauer, J. C1 - 53523 C2 - 44900 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - 3591-3593 TI - pyABC: Distributed, likelihood-free inference. JO - Bioinformatics VL - 34 IS - 20 PB - Oxford Univ Press PY - 2018 ER - TY - JOUR AB - Motivation: Mathematical models are nowadays important tools for analyzing dynamics of cellular processes. The unknown model parameters are usually estimated from experimental data. These data often only provide information about the relative changes between conditions, hence, the observables contain scaling parameters. The unknown scaling parameters and corresponding noise parameters have to be inferred along with the dynamic parameters. The nuisance parameters often increase the dimensionality of the estimation problem substantially and cause convergence problems.Results: In this manuscript, we propose a hierarchical optimization approach for estimating the parameters for ordinary differential equation (ODE) models from relative data. Our approach restructures the optimization problem into an inner and outer subproblem. These subproblems possess lower dimensions than the original optimization problem, and the inner problem can be solved analytically. We evaluated accuracy, robustness and computational efficiency of the hierarchical approach by studying three signaling pathways. The proposed approach achieved better convergence than the standard approach and required a lower computation time. As the hierarchical optimization approach is widely applicable, it provides a powerful alternative to established approaches. AU - Loos, C. AU - Krause, S. AU - Hasenauer, J. C1 - 53953 C2 - 45127 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - 4266-4273 TI - Hierarchical optimization for the efficient parametrization of ODE models. JO - Bioinformatics VL - 34 IS - 24 PB - Oxford Univ Press PY - 2018 ER - TY - JOUR AB - Summary PESTO is a widely applicable and highly customizable toolbox for parameter estimation in MathWorks M ATLAB. It offers scalable algorithms for optimization, uncertainty and identifiability analysis, which work in a very generic manner, treating the objective function as a black box. Hence, PESTO can be used for any parameter estimation problem, for which the user can provide a deterministic objective function in MATLAB. Availability and implementation PESTO is a MATLAB toolbox, freely available under the BSD license. The source code, along with extensive documentation and example code, can be downloaded from https://github.com/ICB-DCM/PESTO/. Contact jan.hasenauer@helmholtz-muenchen.de Supplementary informationSupplementary dataare available at Bioinformatics online. AU - Stapor, P. AU - Weindl, D. AU - Ballnus, B. AU - Hug, S. AU - Loos, C. AU - Fiedler, A. AU - Krause, S. AU - Hross, S. AU - Fröhlich, F. AU - Hasenauer, J. C1 - 52221 C2 - 43849 CY - Oxford SP - 705-707 TI - PESTO: Parameter EStimation TOolbox. JO - Bioinformatics VL - 34 IS - 4 PB - Oxford Univ Press PY - 2018 ER - TY - JOUR AB - Motivation: Parameter estimation methods for ordinary differential equation (ODE) models of biological processes can exploit gradients and Hessians of objective functions to achieve convergence and computational efficiency. However, the computational complexity of established methods to evaluate the Hessian scales linearly with the number of state variables and quadratically with the number of parameters. This limits their application to low-dimensional problems. Results: We introduce second order adjoint sensitivity analysis for the computation of Hessians and a hybrid optimization-integration-based approach for profile likelihood computation. Second order adjoint sensitivity analysis scales linearly with the number of parameters and state variables. The Hessians are effectively exploited by the proposed profile likelihood computation approach. We evaluate our approaches on published biological models with real measurement data. Our study reveals an improved computational efficiency and robustness of optimization compared to established approaches, when using Hessians computed with adjoint sensitivity analysis. The hybrid computation method was more than 2-fold faster than the best competitor. Thus, the proposed methods and implemented algorithms allow for the improvement of parameter estimation for medium and large scale ODE models. AU - Stapor, P. AU - Fröhlich, F. AU - Hasenauer, J. C1 - 54056 C2 - 45262 CY - Great Clarendon St, Oxford Ox2 6dp, England SP - 151-159 TI - Optimization and profile calculation of ODE models using second order adjoint sensitivity analysis. JO - Bioinformatics VL - 34 IS - 13 PB - Oxford Univ Press PY - 2018 ER - TY - JOUR AB - Motivation: The identification of heterogeneities in cell populations by utilizing single-cell technologies such as single-cell RNA-Seq, enables inference of cellular development and lineage trees. Several methods have been proposed for such inference from high-dimensional single-cell data. They typically assign each cell to a branch in a differentiation trajectory. However, they commonly assume specific geometries such as tree-like developmental hierarchies and lack statistically sound methods to decide on the number of branching events. Results: We present K-Branches, a solution to the above problem by locally fitting half-lines to single-cell data, introducing a clustering algorithm similar to K-Means. These halflines are proxies for branches in the differentiation trajectory of cells. We propose a modified version of the GAP statistic for model selection, in order to decide on the number of lines that best describe the data locally. In this manner, we identify the location and number of subgroups of cells that are associated with branching events and full differentiation, respectively. We evaluate the performance of our method on single-cell RNA-Seq data describing the differentiation of myeloid progenitors during hematopoiesis, single-cell qPCR data of mouse blastocyst development, single-cell qPCR data of human myeloid monocytic leukemia and artificial data. Availability: An R implementation of K-Branches is freely available at https://github.com/theislab/kbranches. AU - Chlis, N.-K. AU - Wolf, F.A. AU - Theis, F.J. C1 - 51246 C2 - 42970 CY - Oxford SP - 3211-3219 TI - Model-based branching point detection in single-cell data by K-Branches clustering. JO - Bioinformatics VL - 33 IS - 20 PB - Oxford Univ Press PY - 2017 ER - TY - JOUR AB - Modelling biological associations or dependencies using linear regression is often complicated when the analyzed data-sets are high-dimensional and less observations than variables are availableling this issue. Recently proposed regression models utilize prior knowledge on dependencies, e.g. in the form of graphs, arguing that this information will lead to more reliable estimates for regression coefficients. However, none of the proposed models for multivariate genomic response variables have been implemented as a computationally efficient, freely available library. In this paper we propose netReg, a package for graph-penalized regression models that use large networks and thousands of variables. netReg incorporates a priori generated biological graph information into linear models yielding sparse or smooth solutions for regression coefficients. netReg is implemented as both R-package and C ++ commandline tool. The main computations are done in C ++, where we use Armadillo for fast matrix calculations and Dlib for optimization. The R package is freely available on https://bioconductor.org/packages/netReg. The command line tool can be installed using the conda channel Bioconda. Installation details, issue reports, development versions, documentation and tutorials for the R and C ++ versions and the R package vignette can be found on GitHub ext-link-type="https://dirmeier.github.io/netReg/. The GitHub page also contains code for benchmarking and example datasets used in this paper. AU - Dirmeier, S.* AU - Fuchs, C. AU - Müller, N.S. AU - Theis, F.J. C1 - 52229 C2 - 43848 CY - Oxford SP - 896-898 TI - netReg: Network-regularized linear models for biological association studies. JO - Bioinformatics VL - 34 IS - 5 PB - Oxford Univ Press PY - 2017 ER - TY - JOUR AB - Motivation: Ordinary differential equation (ODE) models are frequently used to describe the dynamic behaviour of biochemical processes. Such ODE models are often extended by events to describe the effect of fast latent processes on the process dynamics. To exploit the predictive power of ODE models, their parameters have to be inferred from experimental data. For models without events, gradient based optimization schemes perform well for parameter estimation, when sensitivity equations are used for gradient computation. Yet, sensitivity equations for models with parameter- and state-dependent events and event-triggered observations are not supported by existing toolboxes. Results: In this manuscript, we describe the sensitivity equations for differential equation models with events and demonstrate how to estimate parameters from event-resolved data using event-triggered observations in parameter estimation. We consider a model for GFP expression after transfection and a model for spiking neurons and demonstrate that we can improve computational efficiency and robustness of parameter estimation by using sensitivity equations for systems with events. Moreover, we demonstrate that, by using event-outputs, it is possible to consider event-resolved data, such as time-to-event data, for parameter estimation with ODE models. By providing a user-friendly, modular implementation in the toolbox AMICI, the developed methods are made publicly available and can be integrated in other systems biology toolboxes. Availability and Implementation: We implement the methods in the open-source toolbox Advanced MATLAB Interface for CVODES and IDAS (AMICI, https://github.com/ICB-DCM/AMICI). AU - Fröhlich, F. AU - Theis, F.J. AU - Rädler, J.O.* AU - Hasenauer, J. C1 - 50367 C2 - 42173 CY - Oxford SP - 1049-1056 TI - Parameter estimation for dynamical systems with discrete events and logical operations. JO - Bioinformatics VL - 33 IS - 7 PB - Oxford Univ Press PY - 2017 ER - TY - JOUR AB - Analysis of Next Generation Sequencing (NGS) data requires the processing of large datasets by chaining various tools with complex input and output formats. In order to automate data analysis, we propose to standardize NGS tasks into modular workflows. This simplifies reliable handling and processing of NGS data, and corresponding solutions become substantially more reproducible and easier to maintain. Here, we present a documented, linux-based, toolbox of 42 processing modules that are combined to construct workflows facilitating a variety of tasks such as DNAseq and RNAseq analysis. We also describe important technical extensions. The high throughput executor (HTE) helps to increase the reliability and to reduce manual interventions when processing complex datasets. We also provide a dedicated binary manager that assists users in obtaining the modules' executables and keeping them up to date. As basis for this actively developed toolbox we use the workflow management software KNIME. AU - Hastreiter, M. AU - Jeske, T. AU - Hoser, J.D.S. AU - Kluge, M. AU - Ahomaa, K. AU - Friedl, M.-S. AU - Kopetzky, S.J. AU - Quell, J. AU - Mewes, H.-W. AU - Küffner, R. C1 - 50655 C2 - 42770 CY - Oxford SP - 1565-1567 TI - KNIME4NGS: A comprehensive toolbox for next generation sequencing analysis. JO - Bioinformatics VL - 33 IS - 10 PB - Oxford Univ Press PY - 2017 ER - TY - JOUR AB - Motivation: Quantitative large-scale cell microscopy is widely used in biological and medical research. Such experiments produce huge amounts of image data and thus require automated analysis. However, automated detection of cell outlines (cell segmentation) is typically challenging due to, e.g., high cell densities, cell-to-cell variability and low signal-to-noise ratios. Results: Here, we evaluate accuracy and speed of various state-of-the-art approaches for cell segmentation in light microscopy images using challenging real and synthetic image data. The results vary between datasets and show that the tested tools are either not robust enough or computationally expensive, thus limiting their application to large-scale experiments. We therefore developed fastER, a trainable tool that is orders of magnitude faster while producing state-of-the-art segmentation quality. It supports various cell types and image acquisition modalities, but is easy-to-use even for non-experts: it has no parameters and can be adapted to specific image sets by interactively labelling cells for training. As a proof of concept, we segment and count cells in over 200,000 brightfield images (1388 × 1040 pixels each) from a six day time-lapse microscopy experiment; identification of over 46,000,000 single cells requires only about two and a half hours on a desktop computer. AU - Hilsenbeck, O.* AU - Schwarzfischer, M. AU - Loeffler, D.* AU - Dimopoulos, S.* AU - Hastreiter, S.* AU - Marr, C. AU - Theis, F.J. AU - Schroeder, T.* C1 - 50733 C2 - 42479 CY - Oxford SP - 2020-2028 TI - fastER: A user-friendly tool for ultrafast and robust cell segmentation in large-scale microscopy. JO - Bioinformatics VL - 33 IS - 13 PB - Oxford Univ Press PY - 2017 ER - TY - JOUR AB - MOTIVATION: Cross-reactivity or invocation of autoimmune side effects in various tissues has important safety implications in adoptive immunotherapy directed against selected antigens. The ability to predict cross-reactivity (on-target and off-target toxicities) may help in the early selection of safer therapeutically relevant target antigens. RESULTS: We developed a methodology for the calculation of quantitative cross-reactivity for any defined peptide epitope. Using this approach, we performed assessment of four groups of 283 currently known human MHC-class-I epitopes including differentiation antigens, overexpressed proteins, cancer-testis (CT) antigens, and mutations displayed by tumor cells. In addition, 89 epitopes originating from viral sources were investigated. The natural occurrence of these epitopes in human tissues was assessed based on proteomics abundance data, while the probability of their presentation by MHC-class-I molecules was modeled by the method of Kesmir et al. (2002), which combines proteasomal cleavage, TAP affinity and MHC-binding predictions. The results of these analyses for many previously defined peptides are presented as cross-reactivity indices and tissue profiles. The methodology thus allows for quantitative comparisons of epitopes, and is suggested to be suited for the assessment of epitopes of candidate antigens in an early stage of development of adoptive immunotherapy. AVAILABILITY: Our method is implemented as a Java program, with curated datasets stored in a MySQL database. It predicts all naturally possible self-antigens for a given sequence of a therapeutic antigen (or epitope), and after filtering for predicted immunogenicity outputs results as an index and profile of cross-reactivity to the self-antigens in 22 human tissues. AU - Jaravine, V.* AU - Raffegerst, S.* AU - Schendel, D.J.* AU - Frishman, D. C1 - 49455 C2 - 32404 CY - Oxford SP - 104-111 TI - Assessment of cancer and virus antigens for cross-reactivity in human tissues. JO - Bioinformatics VL - 33 IS - 1 PB - Oxford Univ Press PY - 2017 ER - TY - JOUR AB - Motivation: Stochastic molecular processes are a leading cause of cell-to-cell variability. Their dynamics are often described by continuous-time discrete-state Markov chains and simulated using stochastic simulation algorithms. As these stochastic simulations are computationally demanding, ordinary differential equation models for the dynamics of the statistical moments have been developed. The number of state variables of these approximating models, however, grows at least quadratically with the number of biochemical species. This limits their application to small-and medium-sized processes. Results: In this article, we present a scalable moment-closure approximation (sMA) for the simulation of statistical moments of large-scale stochastic processes. The sMA exploits the structure of the biochemical reaction network to reduce the covariance matrix. We prove that sMA yields approximating models whose number of state variables depends predominantly on local properties, i.e. the average node degree of the reaction network, instead of the overall network size. The resulting complexity reduction is assessed by studying a range of medium-and large-scale biochemical reaction networks. To evaluate the approximation accuracy and the improvement in computational efficiency, we study models for JAK2/STAT5 signalling and NFjB signalling. Our method is applicable to generic biochemical reaction networks and we provide an implementation, including an SBML interface, which renders the sMA easily accessible. AU - Kazeroonian, A. AU - Theis, F.J. AU - Hasenauer, J. C1 - 51595 C2 - 43271 CY - Oxford SP - i293-i300 TI - A scalable moment-closure approximation for large-scale biochemical reaction networks. JO - Bioinformatics VL - 33 IS - 14 PB - Oxford Univ Press PY - 2017 ER - TY - JOUR AB - Motivation Mathematical modeling using ordinary differential equations is used in systems biology to improve the understanding of dynamic biological processes. The parameters of ordinary differential equation models are usually estimated from experimental data. To analyze a priori the uniqueness of the solution of the estimation problem, structural identifiability analysis methods have been developed. Results We introduce GenSSI 2.0, an advancement of the software toolbox GenSSI (Generating Series for testing Structural Identifiability). GenSSI 2.0 is the first toolbox for structural identifiability analysis to implement Systems Biology Markup Language import, state/parameter transformations and multi-experiment structural identifiability analysis. In addition, GenSSI 2.0 supports a range of MATLAB versions and is computationally more efficient than its previous version, enabling the analysis of more complex models. AU - Ligon, T.S.* AU - Fröhlich, F. AU - Chi, O.T.* AU - Banga, J.R.* AU - Balsa-Canto, E.* AU - Hasenauer, J. C1 - 52836 C2 - 44196 SP - 1421-1423 TI - GenSSI 2.0: Multi-experiment structural identifiability analysis of SBML models. JO - Bioinformatics VL - 34 IS - 8 PY - 2017 ER - TY - JOUR AB - Motivation: Dynamics of cellular processes are often studied using mechanistic mathematical models. These models possess unknown parameters which are generally estimated from experimental data assuming normally distributed measurement noise. Outlier corruption of datasets often cannot be avoided. These outliers may distort the parameter estimates, resulting in incorrect model predictions. Robust parameter estimation methods are required which provide reliable parameter estimates in the presence of outliers. Results: In this manuscript, we propose and evaluate methods for estimating the parameters of ordinary differential equation (ODE) models from outlier-corrupted data. As alternatives to the normal distribution as noise distribution, we consider the Laplace, the Huber, the Cauchy and the Student’s t distribution. We assess accuracy, robustness and computational efficiency of estimators using these different distribution assumptions. To this end, we consider artificial data of a conversion process, as well as published experimental data for Epo-induced JAK/STAT signaling. We study how well the methods can compensate and discover artificially introduced outliers. Our evaluation reveals that using alternative distributions improves the robustness of parameter estimates. Availability: The MATLAB implementation of the likelihood functions using the distribution assumptions is available at Bioinformatics online. AU - Maier, C. AU - Loos, C. AU - Hasenauer, J. C1 - 49864 C2 - 41991 CY - Oxford SP - 1-8 TI - Robust parameter estimation for dynamical systems from outlier-corrupted data. JO - Bioinformatics VL - 33 IS - 5 PB - Oxford Univ Press PY - 2017 ER - TY - JOUR AB - MOTIVATION: LD score regression is a reliable and efficient method of using genome-wide association study (GWAS) summary-level results data to estimate the SNP heritability of complex traits and diseases, partition this heritability into functional categories, and estimate the genetic correlation between different phenotypes. Because the method relies on summary level results data, LD score regression is computationally tractable even for very large sample sizes. However, publicly available GWAS summary-level data are typically stored in different databases and have different formats, making it difficult to apply LD score regression to estimate genetic correlations across many different traits simultaneously. RESULTS: In this manuscript, we describe LD Hub - a centralized database of summary-level GWAS results for 173 diseases/traits from different publicly available resources/consortia and a web interface that automates the LD score regression analysis pipeline. To demonstrate functionality and validate our software, we replicated previously reported LD score regression analyses of 49 traits/diseases using LD Hub; and estimated SNP heritability and the genetic correlation across the different phenotypes. We also present new results obtained by uploading a recent atopic dermatitis GWAS meta-analysis to examine the genetic correlation between the condition and other potentially related traits. In response to the growing availability of publicly accessible GWAS summary-level results data, our database and the accompanying web interface will ensure maximal uptake of the LD score regression methodology, provide a useful database for the public dissemination of GWAS results, and provide a method for easily screening hundreds of traits for overlapping genetic aetiologies. AVAILABILITY AND IMPLEMENTATION: The web interface and instructions for using LD Hub are available at http://ldsc.broadinstitute.org/ CONTACT: jie.zheng@bristol.ac.ukSupplementary information: Supplementary data are available at Bioinformatics online. AU - Zheng, J.* AU - Erzurumluoglu, A.M.* AU - Elsworth, B.L.* AU - Kemp, J.P.* AU - Howe, L.* AU - Haycock, P.C.* AU - Hemani, G.* AU - Tansey, K.* AU - Laurin, C.* AU - St. Pourcain, B.* AU - Warrington, N.M.* AU - Finucane, H.K.* AU - Price, A.L.* AU - Bulik-Sullivan, B.* AU - Anttila, V.* AU - Paternoster, L.* AU - Gaunt, T.R.* AU - Evans, D.M* AU - Neale, B.M.* AU - EUMODIC Consortium (Heinrich, J.) C1 - 50516 C2 - 42306 SP - 272-279 TI - LD Hub: A centralized database and web interface to perform LD score regression that maximizes the potential of summary level GWAS data for SNP heritability and genetic correlation analysis. JO - Bioinformatics VL - 33 IS - 2 PY - 2017 ER - TY - JOUR AB - Diffusion maps are a spectral method for non-linear dimension reduction and have recently been adapted for the visualization of single cell expression data. Here we present destiny, an efficient R implementation of the diffusion map algorithm. Our package includes a single-cell specific noise model allowing for missing and censored values. In contrast to previous implementations, we further present an efficient nearest-neighbour approximation that allows for the processing of hundreds of thousands of cells and a functionality for projecting new data on existing diffusion maps. We exemplarily apply destiny to a recent time-resolved mass cytometry dataset of cellular reprogramming. AVAILABILITY AND IMPLEMENTATION: destiny is an open-source R/Bioconductor package http://bioconductor.org/packages/destiny also available at https://www.helmholtz-muenchen.de/icb/destiny. A detailed vignette describing functions and workflows is provided with the package. AU - Angerer, P. AU - Haghverdi, L. AU - Büttner, M. AU - Theis, F.J. AU - Marr, C. AU - Buettner, F. C1 - 47552 C2 - 40663 CY - Oxford SP - 1241-1243 TI - Destiny: Diffusion maps for large-scale single-cell data in R. JO - Bioinformatics VL - 32 IS - 8 PB - Oxford Univ Press PY - 2016 ER - TY - JOUR AB - Motivation: The statistical analysis of single-cell data is a challenge in cell biological studies. Tailored statistical models and computational methods are required to resolve the subpopulation structure, i.e. to correctly identify and characterize subpopulations. These approaches also support the unraveling of sources of cell-to-cell variability. Finite mixture models have shown promise, but the available approaches are ill suited to the simultaneous consideration of data from multiple experimental conditions and to censored data. The prevalence and relevance of single-cell data and the lack of suitable computational analytics make automated methods, that are able to deal with the requirements posed by these data, necessary. Results: We present MEMO, a flexible mixture modeling framework that enables the simultaneous, automated analysis of censored and uncensored data acquired under multiple experimental conditions. MEMO is based on maximum-likelihood inference and allows for testing competing hypotheses. MEMO can be applied to a variety of different single-cell data types. We demonstrate the advantages of MEMO by analyzing right and interval censored single-cell microscopy data. Our results show that an examination of censoring and the simultaneous consideration of different experimental conditions are necessary to reveal biologically meaningful subpopulation structures. MEMO allows for a stringent analysis of single-cell data and enables researchers to avoid misinterpretation of censored data. Therefore, MEMO is a valuable asset for all fields that infer the characteristics of populations by looking at single individuals such as cell biology and medicine. Availability: MEMO is implemented in MATLAB and freely available via github (https://github.com/MEMO-toolbox/MEMO). AU - Geissen, E.M.* AU - Hasenauer, J. AU - Heinrich, S.* AU - Hauf, S.* AU - Theis, F.J. AU - Radde, N.* C1 - 48516 C2 - 41109 CY - Oxford SP - 2464-2472 TI - MEMO - multi-experiment mixture model analysis of censored data. JO - Bioinformatics VL - 32 IS - 16 PB - Oxford Univ Press PY - 2016 ER - TY - JOUR AB - MOTIVATION: In vitro and in vivo cell proliferation is often studied using the dye carboxyfluorescein succinimidyl ester (CFSE). The CFSE time-series data provide information about the proliferation history of populations of cells. While the experimental procedures are well established and widely used, the analysis of CFSE time-series data is still challenging. Many available analysis tools do not account for cell age and employ optimization methods that are inefficient (or even unreliable). RESULTS: We present a new model-based analysis method for CFSE time-series data. This method uses a flexible description of proliferating cell populations, namely, a division-, age- and label-structured population model. Efficient maximum likelihood and Bayesian estimation algorithms are introduced to infer the model parameters and their uncertainties. These methods exploit the forward sensitivity equations of the underlying partial differential equation model for efficient and accurate gradient calculation, thereby improving computational efficiency and reliability compared with alternative approaches and accelerating uncertainty analysis. The performance of the method is assessed by studying a dataset for immune cell proliferation. This revealed the importance of different factors on the proliferation rates of individual cells. Among others, the predominate effect of cell age on the division rate is found, which was not revealed by available computational methods. AVAILABILITY AND IMPLEMENTATION: The MATLAB source code implementing the models and algorithms is available from http://janhasenauer.github.io/ShAPE-DALSP/Contact: jan.hasenauer@helmholtz-muenchen.deSupplementary information: Supplementary data are available at Bioinformatics online. AU - Hross, S. AU - Hasenauer, J. C1 - 48539 C2 - 41131 CY - Oxford SP - 2321-2329 TI - Analysis of CFSE time-series data using division-, age- and label-structured population models. JO - Bioinformatics VL - 32 IS - 15 PB - Oxford Univ Press PY - 2016 ER - TY - JOUR AB - MOTIVATION: Linking genes and functional information to genetic variants identified by association studies remains difficult. Resources containing extensive genomic annotations are available but often not fully utilized due to heterogeneous data formats. To enhance their accessibility, we integrated many annotation datasets into a user-friendly webserver. Availability and implementation: http://www.snipa.org/ CONTACT: g.kastenmueller@helmholtz-muenchen.de SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. AU - Arnold, M. AU - Raffler, J. AU - Pfeufer, A. AU - Suhre, K. AU - Kastenmüller, G. C1 - 42854 C2 - 35814 CY - Oxford SP - 1334-1336 TI - SNiPA: An interactive, genetic variant-centered annotation browser. JO - Bioinformatics VL - 31 IS - 8 PB - Oxford Univ Press PY - 2015 ER - TY - JOUR AB - MOTIVATION: Adoptive T cell therapies based on introduction of new T cell receptors (TCRs) into patient recipient T cells is a promising new treatment for various kinds of cancers. A major challenge, however, is the choice of target antigens. If an engineered TCR can cross-react with self-antigens in healthy tissue, the side-effects can be devastating. We present the first webserver for assessing epitope sharing when designing new potential lead targets. We enable the users to find all known proteins containing their peptide of interest. The web server returns not only exact matches, but also approximate ones, allowing a number of mismatches of the users choice. For the identified candidate proteins the expression values in various healthy tissues, representing all vital human organs, are extracted from RNA-Seq data as well as from some cancer tissues as control. All results are returned to the user sorted by a score, which is calculated using well established methods and tools for immunological predictions. It depends on the probability that the epitope is created by proteasomal cleavage and its affinities to the TAP transporter and the MHC class I alleles. With this framework we hope to provide a helpful tool to exclude potential cross-reactivity in the early stage of TCR selection for use in design of adoptive T cell immunotherapy. Availability: The Expitope web server can be accessed via http://webclu.bio.wzw.tum.de/expitope. CONTACT: haase@wzw.tum.de. AU - Haase, K.* AU - Raffegerst, S.H. AU - Schendel, D.J. AU - Frishman, D. C1 - 43222 C2 - 37263 CY - Oxford SP - 1854-1856 TI - Expitope: A web server for epitope expression. JO - Bioinformatics VL - 31 IS - 11 PB - Oxford Univ Press PY - 2015 ER - TY - JOUR AB - MOTIVATION: Single-cell technologies have recently gained popularity in cellular differentiation studies regarding their ability to resolve potential heterogeneities in cell populations. Analysing such high-dimensional single-cell data has its own statistical and computational challenges. Popular multivariate approaches are based on data normalisation, followed by dimension reduction and clustering to identify subgroups. However, in the case of cellular differentiation, we would not expect clear clusters to be present but instead expect the cells to follow continuous branching lineages. RESULTS: Here we propose the use of diffusion maps to deal with the problem of defining differentiation trajectories. We adapt this method to single-cell data by adequate choice of kernel width and inclusion of uncertainties or missing measurement values, which enables the establishment of a pseudo-temporal ordering of single cells in a high-dimensional gene expression space. We expect this output to reflect cell differentiation trajectories, where the data originates from intrinsic diffusion-like dynamics. Starting from a pluripotent stage, cells move smoothly within the transcriptional landscape towards more differentiated states with some stochasticity along their path. We demonstrate the robustness of our method with respect to extrinsic noise (e.g. measurement noise) and sampling density heterogeneities on simulated toy data as well as two single-cell quantitative polymerase chain reaction (qPCR) data sets (i.e. mouse haematopoietic stem cells and mouse embryonic stem cells) and an RNA-Seq data of human pre-implantation embryos. We show that diffusion maps perform considerably better than Principal Component Analysis (PCA) and are advantageous over other techniques for non-linear dimension reduction such as t-distributed Stochastic Neighbour Embedding (t-SNE) for preserving the global structures and pseudotemporal ordering of cells. AVAILABILITY: The Matlab implementation of diffusion maps for single-cell data is available at https://www.helmholtz-muenchen.de/icb/single-cell-diffusion-map. CONTACT: fbuettner.phys@gmail.com, fabian.theis@helmholtz-muenchen.de. AU - Haghverdi, L. AU - Buettner, F. AU - Theis, F.J. C1 - 44922 C2 - 37113 SP - 2989-2998 TI - Diffusion maps for high-dimensional single-cell analysis of differentiation data. JO - Bioinformatics VL - 31 IS - 18 PY - 2015 ER - TY - JOUR AB - MOTIVATION: Extracellular vesicles are spherical bilayered proteolipids, harboring various bioactive molecules. Due to the complexity of the vesicular nomenclatures and components, online searches for extracellular vesicle-related publications and vesicular components are currently challenging. RESULTS: We present an improved version of EVpedia, a public database for extracellular vesicles research. This community web portal contains a database of publications and vesicular components, identification of orthologous vesicular components, bioinformatic tools, and a personalized function. EVpedia includes 6,879 publications, 172,080 vesicular components from 263 high-throughput datasets, and has been accessed >65,000 times from >750 cities. In addition, about 350 members from 73 international research groups have participated in developing EVpedia. This free web-based database might serve as a useful resource to stimulate the emerging field of extracellular vesicle research. Availability and implementation: The web site was implemented in PHP, Java, MySQL and Apache, and is freely available at http://evpedia.info. CONTACT: ysgho@postech.ac.kr. AU - Kim, D.K.* AU - Lee, J.* AU - Kim, S.R.* AU - Choi, D.S.* AU - Yoon, Y.J.* AU - Kim, J.H.* AU - Go, G.* AU - Nhung, D.* AU - Hong, K.* AU - Jang, S.C.* AU - Kim, S.H.* AU - Park, K.S.* AU - Kim, O.Y.* AU - Park, H.T.* AU - Seo, J.H.* AU - Aikawa, E.* AU - Baj-Krzyworzeka, M.* AU - van Balkom, B.W.* AU - Belting, M.* AU - Blanc, L.* AU - Bond, V.* AU - Bongiovanni, A.* AU - Borràs, F.E.* AU - Buée, L.* AU - Buzás, E.I.* AU - Cheng, L.* AU - Clayton, A.* AU - Cocucci, E.* AU - Dela Cruz, C.S.* AU - Desiderio, D.M.* AU - di Vizio, D.* AU - Ekström, K.* AU - Falcón-Pérez, J.M.* AU - Gardiner, C.* AU - Giebel, B.* AU - Greening, D.W.* AU - Gross, J.C.* AU - Gupta, D.* AU - Hendrix, A.* AU - Hill, A.F.* AU - Hill, M.M.* AU - Nolte-'t Hoen, E.* AU - Hwang, D.W.* AU - Inal, J.* AU - Jagannadham, M.V.* AU - Jayachandran, M.* AU - Jee, Y.K.* AU - Jørgensen, M.* AU - Kim, K.P.* AU - Kim, Y.K.* AU - Kislinger, T.* AU - Lässer, C.* AU - Lee, D.S.* AU - Lee, H.* AU - van Leeuwen, J.* AU - Lener, T.* AU - Liu, M.L.* AU - Lötvall, J.* AU - Marcilla, A.* AU - Mathivanan, S. AU - Möller, A.* AU - Morhayim, J.* AU - Mullier, F.* AU - Nazarenko, I.* AU - Nieuwland, R.* AU - Nunes, D.N.* AU - Pang, K.* AU - Park, J.* AU - Patel, T.* AU - Pocsfalvi, G.* AU - del Portillo, H.* AU - Putz, U.* AU - Ramirez, M.I.* AU - Rodrigues, M.L.* AU - Roh, T.Y.* AU - Royo, F.* AU - Sahoo, S.* AU - Schiffelers, R.* AU - Sharma, S.* AU - Siljander, P.* AU - Simpson, R.J.* AU - Soekmadji, C.* AU - Stahl, P.* AU - Stensballe, A.* AU - Stępień, E.* AU - Tahara, H.* AU - Trummer, A.* AU - Valadi, H.* AU - Vella, L.J.* AU - Wai, S.N.* AU - Witwer, K.* AU - Yáñez-Mó, M.* AU - Youn, H.* AU - Zeidler, R. AU - Gho, Y.S.* C1 - 42751 C2 - 35331 CY - Oxford SP - 933-939 TI - EVpedia: A community web portal for extracellular vesicles research. JO - Bioinformatics VL - 31 IS - 6 PB - Oxford Univ Press PY - 2015 ER - TY - JOUR AB - With the widespread availability of high-throughput experimental technologies it has become possible to study hundreds to thousands of cellular factors simultaneously, such as coding- or non-coding mRNA or protein concentrations. Still, extracting information about the underlying regulatory or signaling interactions from these data remains a difficult challenge. We present a flexible approach towards network inference based on linear programming. Our method reconstructs the interactions of factors from a combination of perturbation/non-perturbation and steady-state/time-series data. We show both on simulated and real data that our methods are able to reconstruct the underlying networks fast and efficiently, thus shedding new light on biological processes and, in particular, into disease's mechanisms of action. We have implemented the approach as an R package available through bioconductor. AVAILABILITY AND IMPLEMENTATION: This R package is freely available under the Gnu Public License (GPL-3) from bioconductor.org (http://bioconductor.org/packages/release/bioc/html/lpNet.html) and is compatible with most operating systems (Windows, Linux, Mac OS) and hardware architectures. CONTACT: bettina.knapp@helmholtz-muenchen.de. AU - Matos, M.R.* AU - Knapp, B. AU - Kaderali, L.* C1 - 45057 C2 - 39105 SP - 3231-3233 TI - lpNet: A linear programming approach to reconstruct signal transduction networks. JO - Bioinformatics VL - 31 IS - 19 PY - 2015 ER - TY - JOUR AB - MOTIVATION: High-dimensional single-cell snapshot data are becoming widespread in the systems biology community, as a mean to understand biological processes at the cellular level. However, as temporal information is lost with such data, mathematical models have been limited to capture only static features of the underlying cellular mechanisms. RESULTS: Here, we present a modular framework which allows to recover the temporal behaviour from single-cell snapshot data and reverse engineer the dynamics of gene expression. The framework combines a dimensionality reduction method with a cell time-ordering algorithm to generate pseudo time-series observations. These are in turn used to learn transcriptional ODE models and do model selection on structural network features. We apply it on synthetic data and then on real hematopoietic stem cells data, to reconstruct gene expression dynamics during differentiation pathways and infer the structure of a key gene regulatory network. AVAILABILITY AND IMPLEMENTATION: C++ and Matlab code available at https://www.helmholtz-muenchen.de/fileadmin/ICB/software/inferenceSnapshot.zip. CONTACT: fabian.theis@helmholtz-muenchen.deSupplementary information: Supplementary data are available at Bioinformatics online. AU - Ocone, A. AU - Haghverdi, L. AU - Müller, N.S. AU - Theis, F.J. C1 - 45238 C2 - 37277 CY - Oxford SP - i89-i96 TI - Reconstructing gene regulatory dynamics from high-dimensional single-cell snapshot data. JO - Bioinformatics VL - 31 IS - 12 PB - Oxford Univ Press PY - 2015 ER - TY - JOUR AB - Motivation: Experimentally determined gene regulatory networks can be enriched by computational inference from high-throughput expression profiles. However, the prediction of regulatory interactions is severely impaired by indirect and spurious effects, particularly for eukaryotes. Recently, published methods report improved predictions by exploiting the a priori known targets of a regulator (its local topology) in addition to expression profiles. Results: We find that methods exploiting known targets show an unexpectedly high rate of false discoveries. This leads to inflated performance estimates and the prediction of an excessive number of new interactions for regulators with many known targets. These issues are hidden from common evaluation and cross-validation setups, which is due to Simpson's paradox. We suggest a confidence score recalibration method (CoRe) that reduces the false discovery rate and enables a reliable performance estimation. Conclusions: CoRe considerably improves the results of network inference methods that exploit known targets. Predictions then display the biological process specificity of regulators more correctly and enable the inference of accurate genome-wide regulatory networks in eukaryotes. For yeast, we propose a network with more than 22∈000 confident interactions. We point out that machine learning approaches outside of the area of network inference may be affected as well. AU - Petri, T.* AU - Altmann, S.* AU - Geistlinger, L.* AU - Zimmer, R.* AU - Küffner, R. C1 - 46790 C2 - 37816 SP - 2836-2843 TI - Addressing false discoveries in network inference. JO - Bioinformatics VL - 31 IS - 17 PY - 2015 ER - TY - JOUR AB - Modeling of dynamical systems using ordinary differential equations is a popular approach in the field of Systems Biology. Two of the most critical steps in this approach are to construct dynamical models of biochemical reaction networks for large data sets and complex experimental conditions and to perform efficient and reliable parameter estimation for model fitting. We present a modeling environment for MATLAB that pioneers these challenges. The numerically expensive parts of the calculations such as the solving of the differential equations and of the associated sensitivity system are parallelized and automatically compiled into efficient C code. A variety of parameter estimation algorithms as well as frequentist and Bayesian methods for uncertainty analysis have been implemented and used on a range of applications that lead to publications. AVAILABILITY AND IMPLEMENTATION: The Data2Dynamics modeling environment is MATLAB based, open source and freely available at http://www.data2dynamics.org. CONTACT: andreas.raue@fdm.uni-freiburg.de SUPPLEMENTARY INFORMATION: is provided online and contains detailed description of methodology, a user guide and documentation. AU - Raue, A.* AU - Steiert, B.* AU - Schelker, M.* AU - Kreutz, C.* AU - Maiwald, T.* AU - Hass, H.* AU - Vanlier, J.* AU - Tönsing, C.* AU - Adlung, L.* AU - Engesser, R.* AU - Mader, W.* AU - Heinemann, T.* AU - Hasenauer, J. AU - Schilling, M.* AU - Höfer, T.* AU - Klipp, E.* AU - Theis, F.J. AU - Klingmüller, U.* AU - Schöberl, B.* AU - Timmer, J.* C1 - 45681 C2 - 37421 SP - 3558-3560 TI - Data2Dynamics: A modeling environment tailored to parameter estimation in dynamical systems. JO - Bioinformatics VL - 31 IS - 21 PY - 2015 ER - TY - JOUR AB - SUMMARY: Decreasing costs of modern high-throughput experiments allow for the simultaneous analysis of altered gene activity on various molecular levels. However, these multi-omics approaches lead to a large amount of data which is hard to interpret for a non-bioinformatician. Here, we present the remotely accessible multilevel ontology analysis (RAMONA). It offers an easy-to-use interface for the simultaneous gene set analysis of combined omics datasets and is an extension of the previously introduced MONA approach. RAMONA is based on a Bayesian enrichment method for the inference of overrepresented biological processes among given gene sets. Overrepresentation is quantified by interpretable term probabilities. It is able to handle data from various molecular levels, while in parallel coping with redundancies arising from gene set overlaps and related multiple testing problems. The comprehensive output of RAMONA is easy to interpret and thus allows for functional insight into the affected biological processes. With RAMONA, we provide an efficient implementation of the Bayesian inference problem such that ontologies consisting of thousands of terms can be processed in the order of seconds. Availability and Implementation: RAMONA is implemented as ASP.NET web application and publicly available at http://icb.helmholtz-muenchen.de/ramona. CONTACT: steffen.sass@helmholtz-muenchen.de. AU - Sass, S. AU - Buettner, F. AU - Müller, N.S. AU - Theis, F.J. C1 - 32347 C2 - 35010 CY - Oxford SP - 128-130 TI - RAMONA: A web application for gene set analysis on multilevel omics data. JO - Bioinformatics VL - 31 IS - 1 PB - Oxford Univ Press PY - 2015 ER - TY - JOUR AB - Motivation: Several cancer types consist of multiple genetically and phenotypically distinct subpopulations. The underlying mechanism for this intra-tumoral heterogeneity can be explained by the clonal evolution model, whereby growth advantageous mutations cause the expansion of cancer cell subclones. The recurrent phenotype of many cancers may be a consequence of these coexisting subpopulations responding unequally to therapies. Methods to computationally infer tumor evolution and subpopulation diversity are emerging and they hold the promise to improve the understanding of genetic and molecular determinants of recurrence. Results: To address cellular subpopulation dynamics within human tumors, we developed a bioinformatic method, EXPANDS. It estimates the proportion of cells harboring specific mutations in a tumor. By modeling cellular frequencies as probability distributions, EXPANDS predicts mutations that accumulate in a cell before its clonal expansion. We assessed the performance of EXPANDS on one whole genome sequenced breast cancer and performed SP analyses on 118 glioblastoma multiforme samples obtained from TCGA. Our results inform about the extent of subclonal diversity in primary glioblastoma, subpopulation dynamics during recurrence and provide a set of candidate genes mutated in the most well-adapted subpopulations. In summary, EXPANDS predicts tumor purity and subclonal composition from sequencing data. AU - Andor, N. AU - Harness, J.V.* AU - Müller, S.* AU - Mewes, H.-W. AU - Petritsch, C.* C1 - 29113 C2 - 33683 CY - Oxford SP - 50-60 TI - EXPANDS: Expanding ploidy and allele frequency on nested subpopulations. JO - Bioinformatics VL - 30 IS - 1 PB - Oxford Univ Press PY - 2014 ER - TY - JOUR AB - MOTIVATION: High-throughput single-cell qPCR is a promising technique allowing for new insights in complex cellular processes. However, the PCR reaction can only be detected up to a certain detection limit, while failed reactions could be due to low or absent expression and the true expression level is unknown. As this censoring can occur for high proportions of the data, it is one of the main challenges when dealing with single-cell qPCR data. PCA is an important tool for visualising the structure of high-dimensional data as well as for identifying sub-populations of cells. However, to date it is not clear how to perform a PCA of censored data. We present a probabilistic approach which accounts for the censoring and evaluate it for two typical data-sets containing single-cell qPCR data. RESULTS: We use the Gaussian Process Latent Variable Model (GPLVM) framework to account for censoring by introducing an appropriate noise model and allowing a different kernel for each dimension. We evaluate this new approach for two typical qPCR data-sets (of mouse embryonic stem cells and blood stem/progenitor cells respectively) by performing linear and non-linear probabilistic PCA. Taking the censoring into account results in a 2D representation of the data which better reflects its known structure: in both data-sets our new approach results in a better separation of known cell types and is able to reveal subpopulations in one data-set which could not be resolved using standard PCA. AVAILABILITY: The implementation was based on the existing GPLVM toolbox(1); extensions for noise models and kernels accounting for censoring are available from http://icb.helmholtz-muenchen.de/censgplvm. AU - Buettner, F. AU - Moignard, V.* AU - Göttgens, B.* AU - Theis, F.J. C1 - 30847 C2 - 33951 CY - Oxford SP - 1867-1875 TI - Probabilistic PCA of censored data: Accounting for uncertainties in the visualisation of high-throughput single-cell qPCR data. JO - Bioinformatics VL - 30 IS - 13 PB - Oxford Univ Press PY - 2014 ER - TY - JOUR AB - MOTIVATION: Although the integration and analysis of the activity of small molecules across multiple chemical screens is a common approach to determine the specificity and toxicity of hits, the suitability of these approaches to reveal novel biological information is less explored. Here, we test the hypothesis that assays sharing selective hits are biologically related. RESULTS: We annotated the biological activities (i.e. biological processes or molecular activities) measured in assays and constructed chemical hit profiles with sets of compounds differing on their selectivity level for 1640 assays of ChemBank repository. We compared the similarity of chemical hit profiles of pairs of assays with their biological relationships and observed that assay pairs sharing non-promiscuous chemical hits tend to be biologically related. A detailed analysis of a network containing assay pairs with the highest hit similarity confirmed biological meaningful relationships. Furthermore, the biological roles of predicted molecular targets of the shared hits reinforced the biological associations between assay pairs. CONTACT: monica.campillos@helmholtz-muenchen.de Supplementary information: Supplementary data are available at Bioinformatics online. AU - Liu, X. AU - Campillos, M. C1 - 31973 C2 - 34937 CY - Oxford SP - i579-i586 TI - Unveiling new biological relationships using shared hits of chemical screening assay pairs. JO - Bioinformatics VL - 30 IS - 17 PB - Oxford Univ Press PY - 2014 ER - TY - JOUR AB - Computer-assisted studies of structure, function, and evolution of viruses remains a neglected area of research. The attention of bioinformaticians to this interesting and challenging field is far from commensurate with its medical and biotechnological importance. It is very telling that out of over 200 talks held at ISMB 2013, the largest international bioinformatics conference, only one presentation explicitly dealt with viruses. In contrast to many broad, established and well organized bioinformatics communities (e.g. structural genomics, ontologies, next-generation sequencing, expression analysis), research groups focusing on viruses can probably be counted on the fingers of two hands. The purpose of this review is to increase awareness among bioinformatics researchers about the pressing needs and unsolved problems of computational virology. We focus primarily on RNA viruses that pose problems to many standard bioinformatics analyses due to their compact genome organization, fast mutation rate, and low evolutionary conservation. We provide an overview of tools and algorithms for handling viral sequencing data, detecting functionally important RNA structures, classifying viral proteins into families, and investigating the origin and evolution of viruses. AU - März, M.* AU - Beerenwinkel, N.* AU - Drosten, C.* AU - Fricke, M.* AU - Frishman, D. AU - Hofacker, I.L.* AU - Hoffmann, D.* AU - Middendorf, M.* AU - Rattei, T.* AU - Stadler, P.F.* AU - Töpfer, A.* C1 - 30770 C2 - 33848 CY - Oxford SP - 1793-1799 TI - Challenges in RNA virus bioinformatics. JO - Bioinformatics VL - 30 IS - 13 PB - Oxford Univ Press PY - 2014 ER - TY - JOUR AB - MOTIVATION: RNA-seq techniques generate massive amounts of expression data. Several pipelines (e.g. Tophat and Cufflinks) are broadly applied to analyse these data sets. However, accessing and handling the analytical output remains challenging for non-experts. RESULTS: We present the RNASeqExpressionBrowser, an open-source web interface that can be used to access the output from RNA-seq expression analysis packages in different ways as it allows browsing for genes by identifiers, annotations or sequence similarity. Gene expression information can be loaded as long as it is represented in a matrix like format. Additionally, data can be made available by setting up the tool on a public server. For demonstration purposes, we have set up a version providing expression information from the barley genome. AVAILABILITY: The source code and a show case are accessible at: http://mips.helmholtz-muenchen.de/plant/RNASeqExpressionBrowser/. AU - Nussbaumer, T. AU - Kugler, K.G. AU - Bader, K.C. AU - Sharma, S. AU - Seidel, M. AU - Mayer, K.F.X. C1 - 31291 C2 - 34316 CY - Oxford SP - 2519-2520 TI - RNASeqExpressionBrowser - a web interface to browse and visualize high-throughput expression data. JO - Bioinformatics VL - 30 IS - 17 PB - Oxford Univ Press PY - 2014 ER - TY - JOUR AB - MOTIVATION: Diseases and adverse drug reactions are frequently caused by disruptions in gene functionality. Gaining insight into the global system properties governing the relationships between genotype and phenotype is thus crucial to understand and interfere with perturbations in complex organisms such as diseases states. RESULTS: We present a systematic analysis of phenotypic information of 5,047 perturbations of single genes in mice, 4,766 human diseases, and 1,666 drugs that examines the relationships between different gene properties and the phenotypic impact at the organ system level in mammalian organisms. We observe that while single gene perturbations and alterations of nonessential, tissue-specific genes or those with low betweenness centrality in protein-protein interaction networks often show organ specific effects, multiple gene alterations resulting e.g. from complex disorders and drug treatments have a more widespread impact. Interestingly, certain cellular localizations are distinctly associated to systemic effects in monogenic disease genes and mouse gene perturbations, such as the lumen of intracellular organelles and transcription factor complexes, respectively. In summary, we show that the broadness of the phenotypic effect is clearly related to certain gene properties and is an indicator of the severity of perturbations. This work contributes to the understanding of gene properties influencing the systemic effects of diseases and drugs. AU - Vogt, I. AU - Prinz, J. AU - Worf, K.* AU - Campillos, M. C1 - 31814 C2 - 34779 CY - Oxford SP - 3093-3100 TI - Systematic analysis of gene properties influencing organ system phenotypes in mammalian perturbations. JO - Bioinformatics VL - 30 IS - 21 PB - Oxford Univ Press PY - 2014 ER - TY - JOUR AB - MOTIVATION: High-throughput phenotypic assays reveal information about the molecules that modulate biological processes such as a disease phenotype and a signaling pathway. In these assays, the identification of hits along with their molecular targets is critical to understand the chemical activities modulating the biological system. Here, we present HitPick, a web server for identification of hits in high-throughput chemical screenings and prediction of their molecular targets. HitPick applies the B-score method for hit identification and a newly developed approach combining 1-Nearest-Neighbour (1NN) similarity searching and Laplacian-modified naïve Bayesian target models to predict targets of identified hits. The performance of the HitPick web server is presented and discussed. AVAILABILITY: The server can be accessed at http://mips.helmholtz-muenchen.de/proj/hitpick CONTACT: monica.campillos@helmholtz-muenchen.de. AU - Liu, X. AU - Vogt, I. AU - Haque, T. AU - Campillos, M. C1 - 24714 C2 - 31648 SP - 1910-1912 TI - HitPick: A web server for hit identification and target prediction of chemical screenings. JO - Bioinformatics VL - 29 IS - 15 PB - Oxford Univ. Press PY - 2013 ER - TY - JOUR AB - MOTIVATION: In sequencing studies of common diseases and quantitative traits, power to test rare and low frequency variants individually is weak. To improve power, a common approach is to combine statistical evidence from several genetic variants in a region. Major challenges are how to do the combining and which statistical framework to use. General approaches for testing association between rare variants and quantitative traits include aggregating genotypes and trait values, referred to as 'collapsing', or using a score-based variance component test. However, little attention has been paid to alternative models tailored for protein truncating variants. Recent studies have highlighted the important role that protein truncating variants, commonly referred to as 'loss of function' variants, may have on disease susceptibility and quantitative levels of biomarkers. We propose a Bayesian modelling framework for the analysis of protein truncating variants and quantitative traits. RESULTS: Our simulation results show that our models have an advantage over the commonly used methods. We apply our models to sequence and exome-array data and discover strong evidence of association between low plasma triglyceride levels and protein truncating variants at APOC3 (Apolipoprotein C3). AVAILABILITY: Software is available from http://www.well.ox.ac.uk/~rivas/mamba AU - Rivas, M.A.* AU - Pirinen, M.* AU - Neville, M.J.* AU - Gaulton, K.J.* AU - Moutsianas, L.* AU - GoT2D Consortium (Gieger, C. AU - Grallert, H. AU - Hrabě de Angelis, M. AU - Huth, C. AU - Kriebel, J. AU - Meisinger, C. AU - Meitinger, T. AU - Müller-Nurasyid, M. AU - Peters, A. AU - Rathmann, W. AU - Ried, J.S. AU - Strauch, K. AU - Donnelly, P.) AU - Lindgren, C.M.* AU - Karpe, F.* AU - McCarthy, M.I.* C1 - 43142 C2 - 36017 SP - 2419-2426 TI - Assessing association between protein truncating variants and quantitative traits. JO - Bioinformatics VL - 29 IS - 19 PY - 2013 ER - TY - JOUR AB - Motivation: Single-cell experiments of cells from the early mouse embryo yield gene expression data for different developmental stages from zygote to blastocyst. To better understand cell fate decisions during differentiation, it is desirable to analyse the high-dimensional gene expression data and assess differences in gene expression patterns between different developmental stages as well as within developmental stages. Conventional methods include univariate analyses of distributions of genes at different stages or multivariate linear methods such as principal component analysis (PCA). However, these approaches often fail to resolve important differences as each lineage has a unique gene expression pattern which changes gradually over time yielding different gene expressions both between different developmental stages as well as heterogeneous distributions at a specific stage. Furthermore, to date, no approach taking the temporal structure of the data into account has been presented. Results: We present a novel framework based on Gaussian process latent variable models (GPLVMs) to analyse single-cell qPCR expression data of 48 genes from mouse zygote to blastocyst as presented by (Guo et al., 2010). We extend GPLVMs by introducing gene relevance maps and gradient plots to provide interpretability as in the linear case. Furthermore, we take the temporal group structure of the data into account and introduce a new factor in the GPLVM likelihood which ensures that small distances are preserved for cells from the same developmental stage. Using our novel framework, it is possible to resolve differences in gene expressions for all developmental stages. Furthermore, a new subpopulation of cells within the 16-cell stage is identified which is significantly more trophectoderm-like than the rest of the population. The trophectoderm-like subpopulation was characterized by considerable differences in the expression of Id2, Gata4 and, to a smaller extent, Klf4 and Hand1. The relevance of Id2 as early markers for TE cells is consistent with previously published results. AU - Buettner, F. AU - Theis, F.J. C1 - 10442 C2 - 30247 SP - i626-i632 TI - A novel approach for resolving differences in single-cell gene expression patterns from zygote to blastocyst. JO - Bioinformatics VL - 28 IS - 18 PB - Oxford Univ. Press PY - 2012 ER - TY - JOUR AB - MOTIVATION: Pairing between the target sequence and the 6-8 nt long seed sequence of the miRNA presents the most important feature for miRNA target site prediction. Novel high-throughput technologies such as Argonaute HITS-CLIP afford meanwhile a detailed study of miRNA:mRNA duplices. These interaction maps enable a first discrimination between functional and non-functional target sites in a bulky fashion. Prediction algorithms apply different seed paradigms to identify miRNA target sites. Therefore, a quantitative assessment of miRNA target site prediction is of major interest. RESULTS: We identified a set of canonical seed types based on a transcriptome wide analysis of experimentally verified functional target sites. We confirmed the specificity of long seeds but we found that the majority of functional target sites are formed by less specific seeds of only 6 nt indicating a crucial role of this type. A substantial fraction of genuine target sites arenon-conserved. Moreover, the majority of functional sites remain uncovered by common prediction methods. AU - Ellwanger, D.C. AU - Büttner, F.A. AU - Mewes, H.-W. AU - Stuempflen, V. C1 - 6665 C2 - 29069 SP - 1346-1350 TI - The sufficient minimal set of miRNA seed types. JO - Bioinformatics VL - 27 IS - 10 PB - Oxford Univ. Press PY - 2011 ER - TY - JOUR AU - Smialowski, P. AU - Frishman, D. AU - Kramer, S.* C1 - 899 C2 - 26842 SP - 440-443 TI - Pitfalls of supervised feature selection. JO - Bioinformatics VL - 26 IS - 3 PB - Oxford Univ. Press PY - 2010 ER - TY - JOUR AB - The DICS database is a dynamic web repository of computationally predicted functional modules from the human protein-protein interaction network. It provides references to the CORUM, DrugBank, KEGG and Reactome pathway databases. DICS can be accessed for retrieving sets of overlapping modules and protein complexes that are significantly enriched in a gene list, thereby providing valuable information about the functional context. AU - Dietmann, S. AU - Georgii, E.* AU - Antonov, A.* AU - Tsuda, K.* AU - Mewes, H.-W. C1 - 2146 C2 - 27005 SP - 830-831 TI - The DICS repository: Module-assisted analysis of disease-related gene lists. JO - Bioinformatics VL - 25 IS - 6 PB - Oxford Univ. Press PY - 2009 ER - TY - JOUR AB - Motivation: Modern systems biology aims at understanding how the different molecular components of a biological cell interact. Often, cellular functions are performed by complexes consisting of many different proteins. The composition of these complexes may change according to the cellular environment, and one protein may be involved in several different processes. The automatic discovery of functional complexes from protein interaction data is challenging. While previous approaches use approximations to extract dense modules, our approach exactly solves the problem of dense module enumeration. Furthermore, constraints from additional information sources such as gene expression and phenotype data can be integrated, so we can systematically mine for dense modules with interesting profiles. Results: Given a weighted protein interaction network, our method discovers all protein sets that satisfy a user-defined minimum density threshold. We employ a reverse search strategy, which allows us to exploit the density criterion in an efficient way. Our experiments show that the novel approach is feasible and produces biologically meaningful results. In comparative validation studies using yeast data, the method achieved the best overall prediction performance with respect to confirmed complexes. Moreover, by enhancing the yeast network with phenotypic and phylogenetic profiles and the human network with tissue-specific expression data, we identified condition-dependent complex variants. AU - Georgii, E.* AU - Dietmann, S. AU - Uno, T.* AU - Pagel, P. AU - Tsuda, K.* C1 - 564 C2 - 27004 SP - 933-940 TI - Enumeration of condition-dependent dense modules in protein interaction networks. JO - Bioinformatics VL - 25 IS - 7 PB - Oxford Univ. Press PY - 2009 ER - TY - JOUR AB - Cross-mapping of gene and protein identifiers between different databases is a tedious and time-consuming task. To overcome this, we developed CRONOS, a cross-reference server that contains entries from five mammalian organisms presented by major gene and protein information resources. Sequence similarity analysis of the mapped entries shows that the cross-references are highly accurate. In total, up to 18 different identifier types can be used for identification of cross-references. The quality of the mapping could be improved substantially by exclusion of ambiguous gene and protein names which were manually validated. Organism-specific lists of ambiguous terms, which are valuable for a variety of bioinformatics applications like text mining are available for download. AU - Wägele, B. AU - Dunger-Kaltenbach, I. AU - Fobo, G. AU - Montrone, C. AU - Mewes, H.-W. AU - Ruepp, A. C1 - 2386 C2 - 25961 SP - 141-143 TI - CRONOS: The cross-reference navigation server. JO - Bioinformatics VL - 25 IS - 1 PB - Oxford Univ. Press PY - 2009 ER - TY - JOUR AB - Motivation: In principle, an organisms ability to survive in a specific environment, is an observable result of the organisms regulatory and metabolic capabilities. Nonetheless, current knowledge about the global relation of the metabolisms and the niches of organisms is still limited. Results: In order to further investigate this relation, we grouped species showing similar metabolic capabilities and systematically mapped their habitats onto these groups. For this purpose, we predicted the metabolic capabilities for 214 sequenced genomes. Based on these predictions, we grouped the genomes by hierarchical clustering. Finally, we mapped different environmental conditions and diseases related to the genomes onto the resulting clusters. This mapping uncovered several conditions and diseases that were unexpectedly enriched in clusters of metabolically similar species. As an example, Encephalitozoon cuniculi-a microsporidian causing a multisystemic disease accompanied by CNS problems in rabbits occurred in the same metabolism-based cluster as bacteria causing similar symptoms in humans. AU - Kastenmüller, G. AU - Gasteiger, J.* AU - Mewes, H.-W. C1 - 1879 C2 - 25940 SP - i56-i62 TI - An environmental perspective on large-scale genome clustering based on metabolic capabilities. JO - Bioinformatics VL - 24 IS - 16 PB - Oxford Univ. Press PY - 2008 ER - TY - JOUR AB - Modern machine learning methods based on matrix decomposition techniques, like independent component analysis (ICA) or non-negative matrix factorization (NMF), provide new and efficient analysis tools which are currently explored to analyze gene expression profiles. These exploratory feature extraction techniques yield expression modes (ICA) or metagenes (NMF). These extracted features are considered indicative of underlying regulatory processes. They can as well be applied to the classification of gene expression datasets by grouping samples into different categories for diagnostic purposes or group genes into functional categories for further investigation of related metabolic pathways and regulatory networks. RESULTS: In this study we focus on unsupervised matrix factorization techniques and apply ICA and sparse NMF to microarray datasets. The latter monitor the gene expression levels of human peripheral blood cells during differentiation from monocytes to macrophages. We show that these tools are able to identify relevant signatures in the deduced component matrices and extract informative sets of marker genes from these gene expression profiles. The methods rely on the joint discriminative power of a set of marker genes rather than on single marker genes. With these sets of marker genes, corroborated by leave-one-out or random forest cross-validation, the datasets could easily be classified into related diagnostic categories. The latter correspond to either monocytes versus macrophages or healthy vs Niemann Pick C disease patients. AU - Schachtner, R.* AU - Lutter, D. AU - Knollmüller, P.* AU - Tomé, A.M.* AU - Theis, F.J. AU - Schmitz, G.* AU - Stetter, M.* AU - Vilda, P.G.* AU - Lang, E.W.* C1 - 2772 C2 - 25530 SP - 1688-1697 TI - Knowledge-based gene expression classification via matrix factorization. JO - Bioinformatics VL - 24 IS - 15 PB - Oxford Univ. Press PY - 2008 ER - TY - JOUR AB - Accurate automatic assignment of protein functions remains a challenge for genome annotation. We have developed and compared the automatic annotation of four bacterial genomes employing a 5-fold cross-validation procedure and several machine learning methods. RESULTS: The analyzed genomes were manually annotated with FunCat categories in MIPS providing a gold standard. Features describing a pair of sequences rather than each sequence alone were used. The descriptors were derived from sequence alignment scores, InterPro domains, synteny information, sequence length and calculated protein properties. Following training we scored all pairs from the validation sets, selected a pair with the highest predicted score and annotated the target protein with functional categories of the prototype protein. The data integration using machine-learning methods provided significantly higher annotation accuracy compared to the use of individual descriptors alone. The neural network approach showed the best performance. The descriptors derived from the InterPro domains and sequence similarity provided the highest contribution to the method performance. The predicted annotation scores allow differentiation of reliable versus non-reliable annotations. The developed approach was applied to annotate the protein sequences from 180 complete bacterial genomes. AVAILABILITY: The FUNcat Annotation Tool (FUNAT) is available on-line as Web Services at http://mips.gsf.de/proj/funat. AU - Tetko, I.V. AU - Rodchenkov, I. AU - Walter, M.C. AU - Rattei, T.* AU - Mewes, H.-W. C1 - 794 C2 - 25514 SP - 621-628 TI - Beyond the 'best' match: Machine learning annotation of protein sequences by integration of different sources of information. JO - Bioinformatics VL - 24 IS - 5 PB - Oxford Univ. Press PY - 2008 ER - TY - JOUR AB - Gepard provides a user-friendly, interactive application for the quick creation of dotplots. It utilizes suffix arrays to reduce the time complexity of dotplot calculation to {Theta}(m*log n). A client–server mode, which is a novel feature for dotplot creation software, allows the user to calculate dotplots and color them by functional annotation without any prior downloading of sequence or annotation data. AU - Krumsiek, J. AU - Arnold, R. AU - Rattei, T.* C1 - 5816 C2 - 24573 SP - 1026-1028 TI - Gepard: A rapid and sensitive tool for creating dotplots on genome scale. JO - Bioinformatics VL - 23 IS - 8 PB - Oxford Univ. Press PY - 2007 ER - TY - JOUR AB - Conserved domains represent essential building blocks of most known proteins. Owing to their role as modular components carrying out specific functions they form a network based both on functional relations and direct physical interactions. We have previously shown that domain interaction networks provide substantially novel information with respect to networks built on full-length protein chains. In this work we present a comprehensive web resource for exploring the Domain Interaction MAp (DIMA), interactively. The tool aims at integration of multiple data sources and prediction techniques, two of which have been implemented so far: domain phylogenetic profiling and experimentally demonstrated domain contacts from known three-dimensional structures. A powerful yet simple user interface enables the user to compute, visualize, navigate and download domain networks based on specific search criteria. AU - Pagel, P. AU - Oesterheld, M. AU - Stuempflen, V. AU - Frishman, D. C1 - 5484 C2 - 24115 SP - 997-998 TI - The DIMA web resource-exploring the protein domain network. JO - Bioinformatics VL - 22 IS - 8 PB - Oxford Univ. Press PY - 2006 ER - TY - JOUR AB - Some plant microRNAs have been shown to be de novo generated by inverted duplication from their target genes. Subsequent duplication events potentially generate multigene microRNA families. Within this article we provide supportive evidence for the inverted duplication model of plant microRNA evolution. First, we report that the precursors of four Arabidopsis thaliana microRNA families, miR157, miR158, miR405 and miR447 share nearly identical nucleotide sequences throughout the whole miRNA precursor between the family members. The extent and degree of sequence conservation is suggestive of recent evolutionary duplication events. Furthermore we found that sequence similarities are not restricted to the transcribed part but extend into the promoter regions. Thus the duplication event most probably included the promoter regions as well. Conserved elements in upstream regions of miR163 and its targets were also detected. This implies that the inverted duplication of target genes, at least in certain cases, had included the promoters of the target genes. Sequence conservation within promoters of miRNA families as well as between miRNA and its potential progenitor gene can be exploited for understanding the regulation of microRNA genes. AU - Wang, Y. AU - Hindemitt, T. AU - Mayer, K.F.X. C1 - 4567 C2 - 23982 SP - 2585-2589 TI - Significant sequence similarities in promoters and precursors of Arabidopsis thaliana non-conserved microRNAs. JO - Bioinformatics VL - 22 IS - 21 PB - Oxford Univ. Press PY - 2006 ER - TY - JOUR AB - Motivation: Sequence similarity searches are of great importance in bioinformatics. Exhaustive searches for homologous proteins in databases are computationally expensive and can be replaced by a database of pre-calculated homologies in many cases. Retrieving similarities from an incrementally updated database instead of repeatedly recalculating them should provide homologs much faster and frees computational resources for other purposes. Results: We have implemented SIMAP-a database containing the similarity space formed by almost all amino acid sequences from public databases and completely sequenced genomes. The database is capable of handling very large datasets and allows incremental updates. We have implemented a powerful backbone for similarity computation, which is based on FASTA heuristics. By providing WWW interfaces as well as web services, we make our data accessible to the worldwide community. We have also adapted procedures to detect putative orthologs as example applications. Availability: The SIMAP portal page providing links to SIMAP services is publicly available: http://mips.gsf.de/services/analysis/simap/. The web services can be accessed under http://mips.gsf.de/proj/hobitws/services/RPCSimapService?wsdl and http://mips.gsf.de/proj/hobitws/services/DocSimapService?wsdl. AU - Arnold, R. AU - Rattei, T.* AU - Tischler, P. AU - Truong, M.-D. AU - Stuempflen, V. AU - Mewes, H.-W. C1 - 3930 C2 - 23419 SP - 42-46 TI - SIMAP - The similarity matrix of proteins. JO - Bioinformatics VL - 21 IS - 2 PB - Oxford Univ. Press PY - 2005 ER - TY - JOUR AB - MOTIVATION: Millions of protein sequences currently being deposited to sequence databanks will never be annotated manually. Similarity-based annotation generated by automatic software pipelines unavoidably contains spurious assignments due to the imperfection of bioinformatics methods. Examples of such annotation errors include over- and underpredictions caused by the use of fixed recognition thresholds and incorrect annotations caused by transitivity based information transfer to unrelated proteins or transfer of errors already accumulated in databases. One of the most difficult and timely challenges in bioinformatics is the development of intelligent systems aimed at improving the quality of automatically generated annotation. A possible approach to this problem is to detect anomalies in annotation items based on association rule mining. RESULTS: We present the first large-scale analysis of association rules derived from two large protein annotation databases-Swiss-Prot and PEDANT-and reveal novel, previously unknown tendencies of rule strength distributions. Most of the rules are either very strong or very weak, with rules in the medium strength range being relatively infrequent. Based on dynamics of error correction in subsequent Swiss-Prot releases and on our own manual analysis we demonstrate that exceptions from strong rules are, indeed, significantly enriched in annotation errors and can be used to automatically flag them. We identify different strength dependencies of rules derived from different fields in Swiss-Prot. A compositional breakdown of association rules generated from PEDANT in terms of their constituent items indicates that most of the errors that can be corrected are related to gene functional roles. Swiss-Prot errors are usually caused by under-annotation owing to its conservative approach, whereas automatically generated PEDANT annotation suffers from over-annotation. AVAILABILITY: All data generated in this study are available for download and browsing at http://pedant.gsf.de/ARIA/index.htm. AU - Artamonova, I.I. AU - Frishman, G. AU - Gelfand, M.S.* AU - Frishman, D. C1 - 2424 C2 - 23315 SP - 49-57 TI - Mining sequence annotation databanks for association patterns. JO - Bioinformatics VL - 21 IS - 3 PB - Oxford Univ. Press PY - 2005 ER - TY - JOUR AB - MOTIVATION: Discovery of host and pathogen genes expressed at the plant-pathogen interface often requires the construction of mixed libraries that contain sequences from both genomes. Sequence identification requires high-throughput and reliable classification of genome origin. When using single-pass cDNA sequences difficulties arise from the short sequence length, the lack of sufficient taxonomically relevant sequence data in public databases and ambiguous sequence homology between plant and pathogen genes. RESULTS: A novel method is described, which is independent of the availability of homologous genes and relies on subtle differences in codon usage between plant and fungal genes. We used support vector machines (SVMs) to identify the probable origin of sequences. SVMs were compared to several other machine learning techniques and to a probabilistic algorithm (PF-IND) for expressed sequence tag (EST) classification also based on codon bias differences. Our software (Eclat) has achieved a classification accuracy of 93.1% on a test set of 3217 EST sequences from Hordeum vulgare and Blumeria graminis, which is a significant improvement compared to PF-IND (prediction accuracy of 81.2% on the same test set). EST sequences with at least 50 nt of coding sequence can be classified using Eclat with high confidence. Eclat allows training of classifiers for any host-pathogen combination for which there are sufficient classified training sequences. AVAILABILITY: Eclat is freely available on the Internet (http://mips.gsf.de/proj/est) or on request as a standalone version. AU - Friedel, C.C.* AU - Jahn, K.H.* AU - Sommer, S.* AU - Rudd, S.* AU - Mewes, H.-W. AU - Tetko, I.V. C1 - 2921 C2 - 23421 SP - 1383-1388 TI - Support vector machines for separation of mixed plant-pathogen EST collections based on codon usage. JO - Bioinformatics VL - 21 IS - 8 PB - Oxford Univ. Press PY - 2005 ER - TY - JOUR AB - CREDO is a user-friendly, web-based tool that integrates the analysis and results of different algorithms widely used for the computational detection of conserved sequence motifs in noncoding sequences. It enables easy comparison of the individual results. CREDO offers intuitive interfaces for easy and rapid configuration of the applied algorithms and convenient views on the results in graphical and tabular formats. AU - Hindemitt, T. AU - Mayer, K.F.X. C1 - 5527 C2 - 23354 SP - 4304-4306 TI - CREDO: A web-based tool for computational detection of conserved sequence motifs in noncoding sequences. JO - Bioinformatics VL - 21 IS - 23 PB - Oxford Univ. Press PY - 2005 ER - TY - JOUR AB - SUMMARY: The MIPS mammalian protein-protein interaction database (MPPI) is a new resource of high-quality experimental protein interaction data in mammals. The content is based on published experimental evidence that has been processed by human expert curators. We provide the full dataset for download and a flexible and powerful web interface for users with various requirements. AU - Pagel, P. AU - Kovac, S. AU - Oesterheld, M. AU - Brauner, B. AU - Dunger-Kaltenbach, I. AU - Frishman, G. AU - Montrone, C. AU - Mark, P.* AU - Stuempflen, V. AU - Mewes, H.-W. AU - Ruepp, A. C1 - 4622 C2 - 22434 SP - 832-834 TI - The MIPS mammalian protein-protein interaction database. JO - Bioinformatics VL - 21 IS - 6 PB - Oxford Univ. Press PY - 2005 ER - TY - JOUR AB - Motivation: Any development of new methods for automatic functional annotation of proteins according to their sequences requires high-quality data (as benchmark) as well as tedious preparatory work to generate sequence parameters required as input data for the machine learning methods. Different program settings and incompatible protocols make a comparison of the analyzed methods difficult. Results: The MIPS Bacterial Functional Annotation Benchmark dataset (MIPS-BFAB) is a new, high-quality resource comprising four bacterial genomes manually annotated according to the MIPS functional catalogue (FunCat). These resources include precalculated sequence parameters, such as sequence similarity scores, InterPro domain composition and other parameters that could be used to develop and benchmark methods for functional annotation of bacterial protein sequences. These data are provided in XML format and can be used by scientists who are not necessarily experts in genome annotation. AU - Tetko, I.V. AU - Brauner, B. AU - Dunger-Kaltenbach, I. AU - Frishman, G. AU - Montrone, C. AU - Fobo, G. AU - Ruepp, A. AU - Antonov, A.V. C1 - 2781 C2 - 26120 SP - 2520-2521 TI - MIPS bacterial genomes functional annotation benchmark dataset. JO - Bioinformatics VL - 21 IS - 10 PB - Oxford Univ. Press PY - 2005 ER - TY - JOUR AB - Motivation: Owing to its increased tag length, LongSAGE tags are expected to be more reliable in direct assignment to genome sequences. Therefore, we evaluated the use of LongSAGE data in genome annotation by using our LongSAGE dataset of 202 015 tags (consisting of 41 718 unique tags), experimentally generated from mouse embryonic tail libraries. Results: A fraction of LongSAGE tags could not be unambiguously assigned to its gene, due to the presence of widely conserved sequences downstream of particular CATG anchor sites. The presence of alternative forms of transcripts was confirmed in 45% of all detected genes. Surprisingly, a large fraction of LongSAGE tags with hits to the genome (66%) could not be assigned to any gene annotated in EnsEMBL. Among such cases, 2098 LongSAGE tags fell into a region containing a putative gene predicted by GenScan, providing experimental evidence for the presence of real genes, while 9112 genes were found out to be left out or wrongly annotated by the EnsEMBL pipeline. Conclusions: LongSAGE transcriptome data can significantly improve the genome annotation by identifying novel genes and alternative transcripts, even in the case of thus far best-characterized organisms like the mouse. AU - Wahl, M.B. AU - Heinzmann, U. AU - Imai, K. C1 - 1009 C2 - 22698 SP - 1393-1400 TI - LongSAGE analysis significantly improves genome annotation: identifications of novel genes and alternative transcripts in the mouse. JO - Bioinformatics VL - 21 IS - 8 PB - Oxford Univ. Press PY - 2005 ER - TY - JOUR AB - MOTIVATION: Despite the increasing notions of the functional importance of antisense transcripts in gene regulation, the genome-wide overview on the ontology of antisense genes has not been obtained. Therefore, we tried to find novel antisense genes genome-wide by using our LongSAGE dataset of 202 015 tags (consisting of 41 718 unique tags), experimentally generated from mouse embryonic tail libraries. RESULTS: We identified 1260 potential antisense genes, of which 1001 are not annotated in EnsEMBL, thereby being regarded as novel. Interestingly their sense counterparts were co-expressed in the majority of the cases. CONCLUSIONS: The use of LongSAGE transcriptome data is extremely powerful in the identification of thus-far unknown antisense transcripts, even in the case of well-characterized organisms like the mouse. AU - Wahl, M.B. AU - Heinzmann, U. AU - Imai, K. C1 - 3590 C2 - 22697 SP - 1389-1392 TI - LongSAGE analysis revealed the presence of a large number of novel antisense genes in the mouse genome. JO - Bioinformatics VL - 21 IS - 8 PB - Oxford Univ. Press PY - 2005 ER - TY - JOUR AB - MOTIVATION: Microarray data appear particularly useful to investigate mechanisms in cancer biology and represent one of the most powerful tools to uncover the genetic mechanisms causing loss of cell cycle control. Recently, several different methods to employ microarray data as a diagnostic tool in cancer classification have been proposed. These procedures take changes in the expression of particular genes into account but do not consider disruptions in certain gene interactions caused by the tumor. It is probable that some genes participating in tumor development do not change their expression level dramatically. Thus, they cannot be detected by simple classification approaches used previously. For these reasons, a classification procedure exploiting information related to changes in gene interactions is needed. RESULTS: We propose a MAximal MArgin Linear Programming (MAMA) method for the classification of tumor samples based on microarray data. This procedure detects groups of genes and constructs models (features) that strongly correlate with particular tumor types. The detected features include genes whose functional relations are changed for particular cancer types. The proposed method was tested on two publicly available datasets and demonstrated a prediction ability superior to previously employed classification schemes. AVAILABILITY: The MAMA system was developed using the linear programming system LINDO http://www.lindo.com. A Perl script that specifies the optimization problem for this software is available upon request from the authors. AU - Antonov, A.V. AU - Tetko, I.V. AU - Mader, M.T. AU - Budczies, J. AU - Mewes, H.-W. C1 - 2756 C2 - 22391 SP - 644-652 TI - Optimization models for cancer classification: Extracting gene interaction information from microarray expression data. JO - Bioinformatics VL - 20 IS - 5 PY - 2004 ER - TY - JOUR AB - The Maximal Margin (MAMA) linear programming classification algorithm has recently been proposed and tested for cancer classification based on expression data. It demonstrated sound performance on publicly available expression datasets. We developed a web interface to allow potential users easy access to the MAMA classification tool. Basic and advanced options provide flexibility in exploitation. The input data format is the same as that used in most publicly available datasets. This makes the web resource particularly convenient for non-expert machine learning users working in the field of expression data analysis. AU - Antonov, A.V. AU - Tetko, I.V. AU - Prokopenko, V.V.* AU - Kosykh, D. AU - Mewes, H.-W. C1 - 2757 C2 - 22392 SP - 3284-3285 TI - Web portal for classification of expression data using maximal margin linear programming. JO - Bioinformatics VL - 20 IS - 17 PB - Oxford Univ. Press PY - 2004 ER - TY - JOUR AB - SUMMARY: The Helmholtz Network for Bioinformatics (HNB) is a joint venture of eleven German bioinformatics research groups that offers convenient access to numerous bioinformatics resources through a single web portal. The 'Guided Solution Finder' which is available through the HNB portal helps users to locate the appropriate resources to answer their queries by employing a detailed, tree-like questionnaire. Furthermore, automated complex tool cascades ('tasks'), involving resources located on different servers, have been implemented, allowing users to perform comprehensive data analyses without the requirement of further manual intervention for data transfer and re-formatting. Currently, automated cascades for the analysis of regulatory DNA segments as well as for the prediction of protein functional properties are provided. AVAILABILITY: The HNB portal is available at http://www.hnbioinfo.de AU - Crass, T.* AU - Gailus-Durner, V. AU - Grote, K. AU - O'Keeffe, S. AU - Mewes, H.-W. AU - Mokrejs, M. AU - Schneider, R. AU - Thoppae, G. AU - Warfsmann, J. AU - Werner, T. C1 - 2993 C2 - 22407 SP - 268-270 TI - The Helmholtz network for bioinformatics: An integrative web portal bioinformatics resources. JO - Bioinformatics VL - 20 IS - 2 PB - Oxford Univ. Press PY - 2004 ER - TY - JOUR AB - Summary: Association studies may request more details of a specific haplotype. Haplotype-specific decay of linkage disequilibrium is such a crucial and versatile characteristic. It may be used, e.g. to search for signals of natural selection in a risk haplotype. Here, we present a web-based tool to explore the relationship between population frequency and extended linkage disequilibrium measured as haplotype homozygosity of observed haplotypes within a specified candidate region. AU - Müller, J.C. AU - Andreoli, C. C1 - 10334 C2 - 21466 SP - 786-787 TI - Plotting haplotype-specific linkage disequilibrium patterns by extended haplotype homozygosity. JO - Bioinformatics VL - 19 IS - 5 PB - Oxford Univ. Press PY - 2003 ER - TY - JOUR AB - Summary: Phylogenetic Web Profiler (PWP) is a web-based service designed to perform phylogenetic profiling of proteins against genomes. The current version offers a selection of 63 completed genomes and available plasmids as annotated in the PEDANT genome database. Unlike currently available applications, this tool offers several choices of ortholog prediction parameters including E-value cutoff, percent length difference tolerance, and annotation similarity. Additional features include tight integration with the PEDANT database and tools to analyze properties of predicted proteins. PWP should prove very useful for the analysis of functional-linkage between proteins. AU - Wong, P.* AU - Kolesov, G. AU - Frishman, D. AU - Houry, W.A.* C1 - 22356 C2 - 21234 SP - 782-783 TI - Phylogenetic Web Profiler. JO - Bioinformatics VL - 19 IS - 6 PY - 2003 ER - TY - JOUR AB - Summary: Mitochondrial and Other Useful SEquences (MOUSE) is an integrated and comprehensive compilation of mtDNA from hypervariable regions I and II and of the low recombining nuclear loci Xq13.3 from about 11 200 humans and great apes, whose geographic and if applicable, linguistic classification is stored with their aligned sequences and publication details. The goal is to provide population geneticists and genetic epidemiologists with a comprehensive and user friendly repository of sequences and population information that is usually dispersed in a variety of other sources. AU - Burckhardt, F. C1 - 22085 C2 - 20737 SP - 890-891 TI - MOUSE (Mitochondrial and Other Useful SEquences) a compilation of population genetic markers. JO - Bioinformatics VL - 18 IS - 6 PY - 2002 ER - TY - JOUR AB - Summary: SNAPper is a network service for predicting gene function based on the conservation of gene order. AU - Kolesov, G. AU - Mewes, H.-W. AU - Frishman, D. C1 - 22364 C2 - 21242 SP - 1017-1019 TI - SNAPper : gene order predicts gene function. JO - Bioinformatics VL - 18 IS - 7 PY - 2002 ER - TY - JOUR AB - Motivation: During evolution, functional regions in genomic sequences tend to be more highly conserved than randomly mutating ‘junk DNA’ so local sequence similarity often indicates biological functionality. This fact can be used to identify functional elements in large eukaryotic DNA sequences by cross-species sequence comparison. In recent years, several gene-prediction methods have been proposed that work by comparing anonymous genomic sequences, for example from human and mouse. The main advantage of these methods is that they are based on simple and generally applicable measures of (local) sequence similarity; unlike standard gene-finding approaches they do not depend on species-specific training data or on the presence of cognate genes in data bases. As all comparative sequence-analysis methods, the new comparative gene-finding approaches critically rely on the quality of the underlying sequence alignments. Results: Herein, we describe a new implementation of the sequence-alignment program DIALIGN that has been developed for alignment of large genomic sequences. We compare our method to the alignment programs PipMaker, WABA and BLAST and we show that local similarities identified by these programs are highly correlated to protein-coding regions. In our test runs, PipMaker was the most sensitive method while DIALIGN was most specific. AU - Morgenstern, B. AU - Rinner, O.* AU - Abdeddaim, S.* AU - Haase, D. AU - Mayer, K.F.X. AU - Dress, A.W.M.* AU - Mewes, H.-W. C1 - 22365 C2 - 21243 SP - 777-787 TI - Exon discovery by genomic sequence alignment. JO - Bioinformatics VL - 18 IS - 6 PY - 2002 ER - TY - JOUR AB - MOTIVATION: Enormous demand for fast and accurate analysis of biological sequences is fuelled by the pace of genome analysis efforts. There is also an acute need in reliable up-to-date genomic databases integrating both functional and structural information. Here we describe the current status of the PEDANT software system for high-throughput analysis of large biological sequence sets and the genome analysis server associated with it. RESULTS: The principal features of PEDANT are: (i) completely automatic processing of data using a wide range of bioinformatics methods, (ii) manual refinement of annotation, (iii) automatic and manual assignment of gene products to a number of functional and structural categories, (iv) extensive hyperlinked protein reports, and (v) advanced DNA and protein viewers. The system is easily extensible and allows to include custom methods, databases, and categories with minimal or no programming effort. PEDANT is actively used as a collaborative environment to support several on-going genome sequencing projects. The main purpose of the PEDANT genome database is to quickly disseminate well-organized information on completely sequenced and unfinished genomes. It currently includes 80 genomic sequences and in many cases serves as the only source of exhaustive information on a given genome. The database also acts as a vehicle for a number of research projects in bioinformatics. Using SQL queries, it is possible to correlate a large variety of pre-computed properties of gene products encoded in complete genomes with each other and compare them with data sets of special scientific interest. In particular, the availability of structural predictions for over 300 000 genomic proteins makes PEDANT the most extensive structural genomics resource available on the web. AU - Frishman, D. AU - Albermann, K.* AU - Hani, J.* AU - Heumann, K.* AU - Metanomski, A.* AU - Zollner, A.* AU - Mewes, H.-W. C1 - 44444 C2 - 36837 SP - 44-57 TI - Functional and structural genomics using PEDANT. JO - Bioinformatics VL - 17 IS - 1 PY - 2001 ER - TY - JOUR AB - SUMMARY: The paper presents details of database construction, website installation and server architecture of the asthma and allergy gene database. AVAILABILITY: Database and server templates are available on request from the first author. SUPPLEMENTARY INFORMATION: The URL of the asthma and allergy gene database is http://cooke.gsf.de AU - Wjst, M. AU - Immervoll, T. C1 - 20928 C2 - 18973 SP - 827-828 TI - An internet linkage and mutation database for the complex phenotype asthma. JO - Bioinformatics VL - 14 IS - 9 PY - 1998 ER - TY - JOUR AB - Advanced sequencing techniques allow rapid deduction of individual amino acid sequences of highly related proteins. Due to their quasi-species nature, viral genomes (e.g. HIV-1) represent one of the most common sources of related proteins. Another example of related proteins are immunoglobulins. Local differences in amino acid conservation are useful indicators of potential domain structures and immunological or functional epitopes prior to structural analysis of proteins. Although variability indices can be calculated by several methods, delineation of boundaries between sequence stretches with similar variability indices is left to the user. We use algorithmic scale-space filtering for delineation of conserved and variable sequence stretches within a protein which is performed on an algorithmic basis avoiding arbitrary assignments. Our method correctly identified variable regions for the human immunoglobulin λ-chain V-regions (subgroup I). Prediction of the variable regions of the HIV-1 gp120 env protein was in agreement with empirical derived definitions. These examples indicate that our method is useful for the regional assignment of protein variability solely on the basis of amino acid sequences. AU - Herrmann, G. AU - Schön, A. AU - Brack-Werner, R. AU - Werner, T. C1 - 33228 C2 - 35605 SP - 197-203 TI - CONRAD: A method for identification of variable and conserved regions within proteins by scale-space filtering. JO - Bioinformatics VL - 12 IS - 3 PY - 1996 ER -