TY - JOUR AB - BACKGROUND: Drug repurposing aims at finding new targets for already developed drugs. It becomes more relevant as the cost of discovering new drugs steadily increases. To find new potential targets for a drug, an abundance of methods and existing biomedical knowledge from different domains can be leveraged. Recently, knowledge graphs have emerged in the biomedical domain that integrate information about genes, drugs, diseases and other biological domains. Knowledge graphs can be used to predict new connections between compounds and diseases, leveraging the interconnected biomedical data around them. While real world use cases such as drug repurposing are only interested in one specific relation type, widely used knowledge graph embedding models simultaneously optimize over all relation types in the graph. This can lead the models to underfit the data that is most relevant for the desired relation type. For example, if we want to learn embeddings to predict links between compounds and diseases but almost the entirety of relations in the graph is incident to other pairs of entity types, then the resulting embeddings are likely not optimised to predict links between compounds and diseases. We propose a method that leverages domain knowledge in the form of metapaths and use them to filter two biomedical knowledge graphs (Hetionet and DRKG) for the purpose of improving performance on the prediction task of drug repurposing while simultaneously increasing computational efficiency. RESULTS: We find that our method reduces the number of entities by 60% on Hetionet and 26% on DRKG, while leading to an improvement in prediction performance of up to 40.8% on Hetionet and 14.2% on DRKG, with an average improvement of 20.6% on Hetionet and 8.9% on DRKG. Additionally, prioritization of antiviral compounds for SARS CoV-2 improves after task-driven filtering is applied. CONCLUSION: Knowledge graphs contain facts that are counter productive for specific tasks, in our case drug repurposing. We also demonstrate that these facts can be removed, resulting in an improved performance in that task and a more efficient learning process. AU - Ratajczak, F. AU - Joblin, M.* AU - Ringsquandl, M.* AU - Hildebrandt, M.* C1 - 64516 C2 - 52245 CY - Campus, 4 Crinan St, London N1 9xw, England TI - Task-driven knowledge graph filtering improves prioritizing drugs for repurposing. JO - BMC Bioinformatics VL - 23 IS - 1 PB - Bmc PY - 2022 SN - 1471-2105 ER - TY - JOUR AB - Background: Tissues are often heterogeneous in their single-cell molecular expression, and this can govern the regulation of cell fate. For the understanding of development and disease, it is important to quantify heterogeneity in a given tissue. Results: We present the R package stochprofML which uses the maximum likelihood principle to parameterize heterogeneity from the cumulative expression of small random pools of cells. We evaluate the algorithm’s performance in simulation studies and present further application opportunities. Conclusion: Stochastic profiling outweighs the necessary demixing of mixed samples with a saving in experimental cost and effort and less measurement error. It offers possibilities for parameterizing heterogeneity, estimating underlying pool compositions and detecting differences between cell populations between samples. AU - Amrhein, L. AU - Fuchs, C. C1 - 61623 C2 - 50357 CY - Campus, 4 Crinan St, London N1 9xw, England TI - stochprofML: Stochastic profiling using maximum likelihood estimation in R. JO - BMC Bioinformatics VL - 22 IS - 1 PB - Bmc PY - 2021 SN - 1471-2105 ER - TY - JOUR AB - BACKGROUND: Deep learning contributes to uncovering molecular and cellular processes with highly performant algorithms. Convolutional neural networks have become the state-of-the-art tool to provide accurate and fast image data processing. However, published algorithms mostly solve only one specific problem and they typically require a considerable coding effort and machine learning background for their application. RESULTS: We have thus developed InstantDL, a deep learning pipeline for four common image processing tasks: semantic segmentation, instance segmentation, pixel-wise regression and classification. InstantDL enables researchers with a basic computational background to apply debugged and benchmarked state-of-the-art deep learning algorithms to their own data with minimal effort. To make the pipeline robust, we have automated and standardized workflows and extensively tested it in different scenarios. Moreover, it allows assessing the uncertainty of predictions. We have benchmarked InstantDL on seven publicly available datasets achieving competitive performance without any parameter tuning. For customization of the pipeline to specific tasks, all code is easily accessible and well documented. CONCLUSIONS: With InstantDL, we hope to empower biomedical researchers to conduct reproducible image processing with a convenient and easy-to-use pipeline. AU - Waibel, D.J.E. AU - Shetab Boushehri, S. AU - Marr, C. C1 - 61458 C2 - 50268 CY - Campus, 4 Crinan St, London N1 9xw, England TI - InstantDL: An easy-to-use deep learning pipeline for image segmentation and classification. JO - BMC Bioinformatics VL - 22 IS - 1 PB - Bmc PY - 2021 SN - 1471-2105 ER - TY - JOUR AB - BackgroundReverse engineering of gene regulatory networks from time series gene-expression data is a challenging problem, not only because of the vast sets of candidate interactions but also due to the stochastic nature of gene expression. We limit our analysis to nonlinear differential equation based inference methods. In order to avoid the computational cost of large-scale simulations, a two-step Gaussian process interpolation based gradient matching approach has been proposed to solve differential equations approximately.ResultsWe apply a gradient matching inference approach to a large number of candidate models, including parametric differential equations or their corresponding non-parametric representations, we evaluate the network inference performance under various settings for different inference objectives. We use model averaging, based on the Bayesian Information Criterion (BIC), to combine the different inferences. The performance of different inference approaches is evaluated using area under the precision-recall curves.ConclusionsWe found that parametric methods can provide comparable, and often improved inference compared to non-parametric methods; the latter, however, require no kinetic information and are computationally more efficient. AU - Dony, L. AU - He, F.* AU - Stumpf, M.P.H.* C1 - 55358 C2 - 46350 CY - Campus, 4 Crinan St, London N1 9xw, England TI - Parametric and non-parametric gradient matching for network inference: A comparison. JO - BMC Bioinformatics VL - 20 IS - 1 PB - Bmc PY - 2019 SN - 1471-2105 ER - TY - JOUR AB - Background: Although several studies have provided insights into the role of long non-coding RNAs (lncRNAs), the majority of them have unknown function. Recent evidence has shown the importance of both lncRNAs and chromatin interactions in transcriptional regulation. Although network-based methods, mainly exploiting gene-lncRNA co-expression, have been applied to characterize lncRNA of unknown function by means of 'guilt-by-association', no strategy exists so far which identifies mRNA-lncRNA functional modules based on the 3D chromatin interaction graph. Results: To better understand the function of chromatin interactions in the context of lncRNA-mediated gene regulation, we have developed a multi-step graph analysis approach to examine the RNA polymerase II ChIA-PET chromatin interaction network in the K562 human cell line. We have annotated the network with gene and lncRNA coordinates, and chromatin states from the ENCODE project. We used centrality measures, as well as an adaptation of our previously developed Markov State Models (MSM) clustering method, to gain a better understanding of lncRNAs in transcriptional regulation. The novelty of our approach resides in the detection of fuzzy regulatory modules based on network properties and their optimization based on co-expression analysis between genes and gene-lncRNA pairs. This results in our method returning more bona fide regulatory modules than other state-of-the art approaches for clustering on graphs. Conclusions: Interestingly, we find that lncRNA network hubs tend to be significantly enriched in evolutionary conserved lncRNAs and enhancer-like functions. We validated regulatory functions for well known lncRNAs, such as MALAT1 and the enhancer-like lncRNA FALEC. In addition, by investigating the modular structure of bigger components we mine putative regulatory functions for uncharacterized lncRNAs. AU - Thiel, D.* AU - Conrad, N.D.* AU - Ntini, E.* AU - Peschutter, R.X.* AU - Siebert, H.* AU - Marsico, A. C1 - 56192 C2 - 46886 CY - Campus, 4 Crinan St, London N1 9xw, England TI - Identifying lncRNA-mediated regulatory modules via ChIA-PET network analysis. JO - BMC Bioinformatics VL - 20 IS - 1 PB - Bmc PY - 2019 SN - 1471-2105 ER - TY - JOUR AB - Background: Genome annotation is of key importance in many research questions. The identification of protein-coding genes is often based on transcriptome sequencing data, ab-initio or homology-based prediction. Recently, it was demonstrated that intron position conservation improves homology-based gene prediction, and that experimental data improves ab-initio gene prediction. Results: Here, we present an extension of the gene prediction program GeMoMa that utilizes amino acid sequence conservation, intron position conservation and optionally RNA-seq data for homology-based gene prediction. We show on published benchmark data for plants, animals and fungi that GeMoMa performs better than the gene prediction programs BRAKER1, MAKER2, and CodingQuarry, and purely RNA-seq-based pipelines for transcript identification. In addition, we demonstrate that using multiple reference organisms may help to further improve the performance of GeMoMa. Finally, we apply GeMoMa to four nematode species and to the recently published barley reference genome indicating that current annotations of protein-coding genes may be refined using GeMoMa predictions. Conclusions: GeMoMa might be of great utility for annotating newly sequenced genomes but also for finding homologs of a specific gene or gene family. GeMoMa has been published under GNU GPL3 and is freely available at http://www.jstacs.de/index.php/GeMoMa. AU - Keilwagen, J.* AU - Hartung, F.* AU - Paulini, M.* AU - Twardziok, S.O. AU - Grau, J.* C1 - 53588 C2 - 44910 TI - Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi. JO - BMC Bioinformatics VL - 19 IS - 1 PY - 2018 SN - 1471-2105 ER - TY - JOUR AB - Background Genome-wide association studies allow us to understand the genetics of complex diseases. Human metabolism provides information about the disease-causing mechanisms, so it is usual to investigate the associations between genetic variants and metabolite levels. However, only considering genetic variants and their effects on one trait ignores the possible interplay between different “omics” layers. Existing tools only consider single-nucleotide polymorphism (SNP)–SNP interactions, and no practical tool is available for large-scale investigations of the interactions between pairs of arbitrary quantitative variables. Results We developed an R package called pulver to compute p-values for the interaction term in a very large number of linear regression models. Comparisons based on simulated data showed that pulver is much faster than the existing tools. This is achieved by using the correlation coefficient to test the null-hypothesis, which avoids the costly computation of inversions. Additional tricks are a rearrangement of the order, when iterating through the different “omics” layers, and implementing this algorithm in the fast programming language C++. Furthermore, we applied our algorithm to data from the German KORA study to investigate a real-world problem involving the interplay among DNA methylation, genetic variants, and metabolite levels. Conclusions The pulver package is a convenient and rapid tool for screening huge numbers of linear regression models for significant interaction terms in arbitrary pairs of quantitative variables. pulver is written in R and C++, and can be downloaded freely from CRAN at https://cran.r-project.org/web/packages/pulver/.   AU - Molnos, S. AU - Baumbach, C. AU - Wahl, S. AU - Müller-Nurasyid, M. AU - Strauch, K. AU - Wang-Sattler, R. AU - Waldenberger, M. AU - Meitinger, T. AU - Adamski, J. AU - Kastenmüller, G. AU - Suhre, K. AU - Peters, A. AU - Grallert, H. AU - Theis, F.J. AU - Gieger, C. C1 - 52023 C2 - 43686 CY - London TI - pulver: An R package for parallel ultra-rapid p-value computation for linear regression interaction terms. JO - BMC Bioinformatics VL - 18 IS - 1 PB - Biomed Central Ltd PY - 2017 SN - 1471-2105 ER - TY - JOUR AB - BACKGROUND: Networks or graphs play an important role in the biological sciences. Protein interaction networks and metabolic networks support the understanding of basic cellular mechanisms. In the human brain, networks of functional or structural connectivity model the information-flow between cortex regions. In this context, measures of network properties are needed. We propose a new measure, Ndim, estimating the complexity of arbitrary networks. This measure is based on a fractal dimension, which is similar to recently introduced box-covering dimensions. However, box-covering dimensions are only applicable to fractal networks. The construction of these network-dimensions relies on concepts proposed to measure fractality or complexity of irregular sets in [Formula: see text]. RESULTS: The network measure Ndim grows with the proliferation of increasing network connectivity and is essentially determined by the cardinality of a maximum k-clique, where k is the characteristic path length of the network. Numerical applications to lattice-graphs and to fractal and non-fractal graph models, together with formal proofs show, that Ndim estimates a dimension of complexity for arbitrary graphs. Box-covering dimensions for fractal graphs rely on a linear log-log plot of minimum numbers of covering subgraph boxes versus the box sizes. We demonstrate the affinity between Ndim and the fractal box-covering dimensions but also that Ndim extends the concept of a fractal dimension to networks with non-linear log-log plots. Comparisons of Ndim with topological measures of complexity (cost and efficiency) show that Ndim has larger informative power. Three different methods to apply Ndim to weighted networks are finally presented and exemplified by comparisons of functional brain connectivity of healthy and depressed subjects. CONCLUSION: We introduce a new measure of complexity for networks. We show that Ndim has the properties of a dimension and overcomes several limitations of presently used topological and fractal complexity-measures. It allows the comparison of the complexity of networks of different type, e.g., between fractal graphs characterized by hub repulsion and small world graphs with strong hub attraction. The large informative power and a convenient computational CPU-time for moderately sized networks may make Ndim a valuable tool for the analysis of biological networks. AU - Hahn, K.R. AU - Massopust, P. AU - Prigarin, S.M.* C1 - 47898 C2 - 39722 CY - London TI - A new method to measure complexity in binary or weighted networks and applications to functional connectivity in the human brain. JO - BMC Bioinformatics VL - 17 IS - 1 PB - Biomed Central Ltd PY - 2016 SN - 1471-2105 ER - TY - JOUR AB - BACKGROUND: The underlying molecular processes representing stress responses to low-dose ionising radiation (LDIR) in mammals are just beginning to be understood. In particular, LDIR effects on the brain and their possible association with neurodegenerative disease are currently being explored using omics technologies. RESULTS: We describe a light-weight approach for the storage, analysis and distribution of relevant LDIR omics datasets. The data integration platform, called BRIDE, contains information from the literature as well as experimental information from transcriptomics and proteomics studies. It deploys a hybrid, distributed solution using both local storage and cloud technology. CONCLUSIONS: BRIDE can act as a knowledge broker for LDIR researchers, to facilitate molecular research on the systems biology of LDIR response in mammals. Its flexible design can capture a range of experimental information for genomics, epigenomics, transcriptomics, and proteomics. The data collection is available at: bride.azurewebsites.net . AU - Karapiperis, C.* AU - Kempf, S.J. AU - Quintens, R.* AU - Azimzadeh, O. AU - Vidal, V.L.* AU - Pazzaglia, S.* AU - Bazyka, D.* AU - Mastroberardino, P.G.* AU - Scouras, Z.G.* AU - Tapio, S. AU - Benotmane, M.A.* AU - Ouzounis, C.A.* C1 - 48599 C2 - 41200 CY - London TI - Brain Radiation Information Data Exchange (BRIDE): Integration of experimental data from low-dose ionising radiation research for pathway discovery. JO - BMC Bioinformatics VL - 17 IS - 1 PB - Biomed Central Ltd PY - 2016 SN - 1471-2105 ER - TY - JOUR AB - BACKGROUND: Interpreting non-targeted metabolomics data remains a challenging task. Signals from non-targeted metabolomics studies stem from a combination of biological causes, complex interactions between them and experimental bias/noise. The resulting data matrix usually contain huge number of variables and only few samples, and classical techniques using nonlinear mapping could result in computational complexity and overfitting. Independent Component Analysis (ICA) as a linear method could potentially bring more meaningful results than Principal Component Analysis (PCA). However, a major problem with most ICA algorithms is the output variations between different runs and the result of a single ICA run should be interpreted with reserve. RESULTS: ICA was applied to simulated and experimental mass spectrometry (MS)-based non-targeted metabolomics data, under the hypothesis that underlying sources are mutually independent. Inspired from the Icasso algorithm, a new ICA method, MetICA was developed to handle the instability of ICA on complex datasets. Like the original Icasso algorithm, MetICA evaluated the algorithmic and statistical reliability of ICA runs. In addition, MetICA suggests two ways to select the optimal number of model components and gives an order of interpretation for the components obtained. CONCLUSIONS: Correlating the components obtained with prior biological knowledge allows understanding how non-targeted metabolomics data reflect biological nature and technical phenomena. We could also extract mass signals related to this information. This novel approach provides meaningful components due to their independent nature. Furthermore, it provides an innovative concept on which to base model selection: that of optimizing the number of reliable components instead of trying to fit the data. The current version of MetICA is available at https://github.com/daniellyz/MetICA . AU - Liu, Y. AU - Smirnov, K. AU - Lucio, M. AU - Gougeon, R.D.* AU - Alexandre, H.* AU - Schmitt-Kopplin, P. C1 - 48037 C2 - 39869 CY - London TI - MetICA: Independent component analysis for high-resolution mass-spectrometry based non-targeted metabolomics. JO - BMC Bioinformatics VL - 17 IS - 1 PB - Biomed Central Ltd PY - 2016 SN - 1471-2105 ER - TY - JOUR AB - Background: The analysis of DNA methylation is a key component in the development of personalized treatment approaches. A common way to measure DNA methylation is the calculation of beta values, which are bounded variables of the form M/(M+U) that are generated by Illumina's 450k BeadChip array. The statistical analysis of beta values is considered to be challenging, as traditional methods for the analysis of bounded variables, such as M-value regression and beta regression, are based on regularity assumptions that are often too strong to adequately describe the distribution of beta values. Results: We develop a statistical model for the analysis of beta values that is derived from a bivariate gamma distribution for the signal intensities M and U. By allowing for possible correlations between M and U, the proposed model explicitly takes into account the data-generating process underlying the calculation of beta values. Using simulated data and a real sample of DNA methylation data from the Heinz Nixdorf Recall cohort study, we demonstrate that the proposed model fits our data significantly better than beta regression and M-value regression. Conclusion: The proposed model contributes to an improved identification of associations between beta values and covariates such as clinical variables and lifestyle factors in epigenome-wide association studies. It is as easy to apply to a sample of beta values as beta regression and M-value regression. AU - Weinhold, L.* AU - Wahl, S. AU - Pechlivanis, S.* AU - Hoffmann, P.* AU - Schmid, M.* C1 - 50055 C2 - 42183 CY - London TI - A statistical model for the analysis of beta values in DNA methylation studies. JO - BMC Bioinformatics VL - 17 PB - Biomed Central Ltd PY - 2016 SN - 1471-2105 ER - TY - JOUR AB - Background: Viruses are the most abundant and genetically diverse biological entities on earth, yet the repertoire of viral proteins remains poorly explored. As the number of sequenced virus genomes grows into the thousands, and the number of viral proteins into the hundreds of thousands, we report a systematic computational analysis of the point of first-contact between viruses and their hosts, namely viral transmembrane (TM) proteins. Results: The complement of aα-helical TM proteins in double-stranded DNA viruses infecting bacteria and archaea reveals large-scale trends that differ from those of their hosts. Viruses typically encode a substantially lower fraction of TM proteins than archaea or bacteria, with the notable exception of viruses with virions containing a lipid component such as a lipid envelope, internal lipid core, or inner membrane vesicle. Compared to bacteriophages, archaeal viruses are substantially enriched in membrane proteins. However, this feature is not always stable throughout the evolution of a viral lineage; for example, TM proteins are not part of the common heritage shared between Lipothrixviridae and Rudiviridae. In contrast to bacteria and archaea, viruses almost completely lack proteins with complicated membrane topologies composed of more than 4 TM segments, with the few detected exceptions being obvious cases of relatively recent horizontal transfer from the host. Conclusions: The dramatic differences between the membrane proteomes of cells and viruses stem from the fact that viruses do not depend on essential membranes for energy transformation, ion homeostasis, nutrient transport and signaling. AU - Kristensen, D.M.* AU - Saeed, U. AU - Frishman, D. AU - Koonin, E.V.* C1 - 47346 C2 - 40516 TI - A census of α-helical membrane proteins in double-stranded DNA viruses infecting bacteria and archaea. JO - BMC Bioinformatics VL - 16 PY - 2015 SN - 1471-2105 ER - TY - JOUR AB - BACKGROUND: Biological data often originate from samples containing mixtures of subpopulations, corresponding e.g. to distinct cellular phenotypes. However, identification of distinct subpopulations may be difficult if biological measurements yield distributions that are not easily separable. RESULTS: We present Multiresolution Correlation Analysis (MCA), a method for visually identifying subpopulations based on the local pairwise correlation between covariates, without needing to define an a priori interaction scale. We demonstrate that MCA facilitates the identification of differentially regulated subpopulations in simulated data from a small gene regulatory network, followed by application to previously published single-cell qPCR data from mouse embryonic stem cells. We show that MCA recovers previously identified subpopulations, provides additional insight into the underlying correlation structure, reveals potentially spurious compartmentalizations, and provides insight into novel subpopulations. CONCLUSIONS: MCA is a useful method for the identification of subpopulations in low-dimensional expression data, as emerging from qPCR or FACS measurements. With MCA it is possible to investigate the robustness of covariate correlations with respect subpopulations, graphically identify outliers, and identify factors contributing to differential regulation between pairs of covariates. MCA thus provides a framework for investigation of expression correlations for genes of interests and biological hypothesis generation. AU - Feigelman, J. AU - Theis, F.J. AU - Marr, C. C1 - 31757 C2 - 34719 CY - London TI - MCA: Multiresolution Correlation Analysis, a graphical tool for subpopulation identification in single-cell gene expression data. JO - BMC Bioinformatics VL - 15 IS - 1 PB - Biomed Central Ltd PY - 2014 SN - 1471-2105 ER - TY - JOUR AB - BACKGROUND: With the help of epigenome-wide association studies (EWAS), increasing knowledge on the role of epigenetic mechanisms such as DNA methylation in disease processes is obtained. In addition, EWAS aid the understanding of behavioral and environmental effects on DNA methylation. In terms of statistical analysis, specific challenges arise from the characteristics of methylation data. First, methylation β-values represent proportions with skewed and heteroscedastic distributions. Thus, traditional modeling strategies assuming a normally distributed response might not be appropriate. Second, recent evidence suggests that not only mean differences but also variability in site-specific DNA methylation associates with diseases, including cancer. The purpose of this study was to compare different modeling strategies for methylation data in terms of model performance and performance of downstream hypothesis tests. Specifically, we used the generalized additive models for location, scale and shape (GAMLSS) framework to compare beta regression with Gaussian regression on raw, binary logit and arcsine square root transformed methylation data, with and without modeling a covariate effect on the scale parameter. RESULTS: Using simulated and real data from a large population-based study and an independent sample of cancer patients and healthy controls, we show that beta regression does not outperform competing strategies in terms of model performance. In addition, Gaussian models for location and scale showed an improved performance as compared to models for location only. The best performance was observed for the Gaussian model on binary logit transformed β-values, referred to as M-values. Our results further suggest that models for location and scale are specifically sensitive towards violations of the distribution assumption and towards outliers in the methylation data. Therefore, a resampling procedure is proposed as a mode of inference and shown to diminish type I error rate in practically relevant settings. We apply the proposed method in an EWAS of BMI and age and reveal strong associations of age with methylation variability that are validated in an independent sample. CONCLUSIONS: Models for location and scale are promising tools for EWAS that may help to understand the influence of environmental factors and disease-related phenotypes on methylation variability and its role during disease development. AU - Wahl, S. AU - Fenske, N.* AU - Zeilinger, S. AU - Suhre, K. AU - Gieger, C. AU - Waldenberger, M. AU - Grallert, H. AU - Schmid, M.* C1 - 31842 C2 - 34801 CY - London TI - On the potential of models for location and scale for genome-wide DNA methylation data. JO - BMC Bioinformatics VL - 15 PB - Biomed Central Ltd PY - 2014 SN - 1471-2105 ER - TY - JOUR AB - Background;In recent years, high-throughput microscopy has emerged as a powerful tool to analyze cellular dynamicsin an unprecedentedly high resolved manner. The amount of data that is generated, for examplein long-term time-lapse microscopy experiments, requires automated methods for processing andanalysis. Available software frameworks are well suited for high-throughput processing of fluorescenceimages, but they often do not perform well on bright field image data that varies considerablybetween laboratories, setups, and even single experiments.Results;In this contribution, we present a fully automated image processing pipeline that is able to robustly segment and analyze cells with ellipsoid morphology from bright field microscopy in a highthroughput, yet time efficient manner. The pipeline comprises two steps: (i) Image acquisition is adjusted to obtain optimal bright field image quality for automatic processing. (ii) A concatenation of fast performing image processing algorithms robustly identifies single cells in each image. We applied the method to a time-lapse movie consisting of ~315,000 images of differentiating hematopoietic stem cells over 6 days. We evaluated the accuracy of our method by comparing the number of identified cells with manual counts. Our method is able to segment images with varying cell density and different cell types without parameter adjustment and clearly outperforms a standard approach. By computing population doubling times, we were able to identify three growth phases in the stem cell population throughout the whole movie, and validated our result with cell cycle times from single cell tracking.Conclusions;Our method allows fully automated processing and analysis of high-throughput bright field microscopydata. The robustness of cell detection and fast computation time will support the analysisof high-content screening experiments, on-line analysis of time-lapse experiments as well as developmentof methods to automatically track single-cell genealogies. AU - Buggenthin, F. AU - Marr, C. AU - Schwarzfischer, M. AU - Hoppe, P.S. AU - Hilsenbeck, O. AU - Schroeder, T. AU - Theis, F.J. C1 - 27970 C2 - 32883 TI - An automatic method for robust and fast cell detection in bright field images from high-throughput microscopy. JO - BMC Bioinformatics VL - 14 IS - 1 PB - Biomed Central PY - 2013 SN - 1471-2105 ER - TY - JOUR AB - Background: Diffusion is a key component of many biological processes such as chemotaxis, developmental differentiation and tissue morphogenesis. Since recently, the spatial gradients caused by diffusion can be assessed in-vitro and in-vivo using microscopy based imaging techniques. The resulting time-series of two dimensional, high-resolutions images in combination with mechanistic models enable the quantitative analysis of the underlying mechanisms. However, such a model-based analysis is still challenging due to measurement noise and sparse observations, which result in uncertainties of the model parameters. Methods: We introduce a likelihood function for image-based measurements with log-normal distributed noise. Based upon this likelihood function we formulate the maximum likelihood estimation problem, which is solved using PDE-constrained optimization methods. To assess the uncertainty and practical identifiability of the parameters we introduce profile likelihoods for diffusion processes. Results and conclusion: As proof of concept, we model certain aspects of the guidance of dendritic cells towards lymphatic vessels, an example for haptotaxis. Using a realistic set of artificial measurement data, we estimate the five kinetic parameters of this model and compute profile likelihoods. Our novel approach for the estimation of model parameters from image data as well as the proposed identifiability analysis approach is widely applicable to diffusion processes. The profile likelihood based method provides more rigorous uncertainty bounds in contrast to local approximation methods. AU - Hock, S. AU - Hasenauer, J. AU - Theis, F.J. C1 - 27731 C2 - 32823 TI - Modeling of 2D diffusion processes based on microscopy data: Parameter estimation and practical identifiability analysis. JO - BMC Bioinformatics VL - 14 IS - 10 PB - Biomed Central PY - 2013 SN - 1471-2105 ER - TY - JOUR AB - Background: Mathematical models are nowadays widely used to describe biochemical reaction networks. One of the main reasons for this is that models facilitate the integration of a multitude of different data and data types using parameter estimation. Thereby, models allow for a holistic understanding of biological processes. However, due to measurement noise and the limited amount of data, uncertainties in the model parameters should be considered when conclusions are drawn from estimated model attributes, such as reaction fluxes or transient dynamics of biological species. Methods and results: We developed the visual analytics system iVUN that supports uncertainty-aware analysis of static and dynamic attributes of biochemical reaction networks modeled by ordinary differential equations. The multivariate graph of the network is visualized as a node-link diagram, and statistics of the attributes are mapped to the color of nodes and links of the graph. In addition, the graph view is linked with several views, such as line plots, scatter plots, and correlation matrices, to support locating uncertainties and the analysis of their time dependencies. As demonstration, we use iVUN to quantitatively analyze the dynamics of a model for Epo-induced JAK2/STAT5 signaling. Conclusion: Our case study showed that iVUN can be used to perform an in-depth study of biochemical reaction networks, including attribute uncertainties, correlations between these attributes and their uncertainties as well as the attribute dynamics. In particular, the linking of different visualization options turned out to be highly beneficial for the complex analysis tasks that come with the biological systems as presented here. AU - Vehlow, C.* AU - Hasenauer, J. AU - Krämer, A.* AU - Raue, A. AU - Hug, S. AU - Timmer, J.* AU - Radde, N.* AU - Theis, F.J. AU - Weiskopf, D.* C1 - 28843 C2 - 32434 TI - iVUN: Interactive Visualization of Uncertain biochemical reaction Networks. JO - BMC Bioinformatics VL - 14 PB - Biomed Central Ltd PY - 2013 SN - 1471-2105 ER - TY - JOUR AB - ABSTRACT: BACKGROUND: Genome-wide association studies (GWAS) with metabolic traits and metabolome-wide association studies (MWAS) with traits of biomedical relevance are powerful tools to identify the contribution of genetic, environmental and lifestyle factors to the etiology of complex diseases. Hypothesis-free testing of ratios between all possible metabolite pairs in GWAS and MWAS has proven to be an innovative approach in the discovery of new biologically meaningful associations. The p-gain statistic was introduced as an ad-hoc measure to determine whether a ratio between two metabolite concentrations carries more information than the two corresponding metabolite concentrations alone. So far, only a rule of thumb was applied to determine the significance of the p-gain. RESULTS: Here we explore the statistical properties of the p-gain through simulation of its density and by sampling of experimental data. We derive critical values of the p-gain for different levels of correlation between metabolite pairs and show that B/(2*alpha) is a conservative critical value for the p-gain, where alpha is the level of significance and B the number of tested metabolite pairs. CONCLUSIONS: We show that the p-gain is a well defined measure that can be used to identify statistically significant metabolite ratios in association studies and provide a conservative significance cut-off for the p-gain for use in future association studies with metabolic traits. AU - Petersen, A.-K. AU - Krumsiek, J. AU - Wägele, B. AU - Theis, F.J. AU - Wichmann, H.-E. AU - Gieger, C. AU - Suhre, K. C1 - 10478 C2 - 30220 TI - On the hypothesis-free testing of metabolite ratios in genome-wide and metabolome-wide association studies. JO - BMC Bioinformatics VL - 13 IS - 1 PB - BioMed Central PY - 2012 SN - 1471-2105 ER - TY - JOUR AB - Background: An increasing number of genomic studies interrogating more than one molecular level is published. Bioinformatics follows biological practice, and recent years have seen a surge in methodology for the integrative analysis of genomic data. Often such analyses require knowledge of which elements of one platform link to those of another. Although important, many integrative analyses do not or insufficiently detail the matching of the platforms. Results: We describe, illustrate and discuss six matching procedures. They are implemented in the R-package sigaR (available from Bioconductor). The principles underlying the presented matching procedures are generic, and can be combined to form new matching approaches or be applied to the matching of other platforms. Illustration of the matching procedures on a variety of data sets reveals how the procedures differ in the use of the available data, and may even lead to different results for individual genes. Conclusions: Matching of data from multiple genomics platforms is an important preprocessing step for many integrative bioinformatic analysis, for which we present six generic procedures, both old and new. They have been implemented in the R-package sigaR, available from Bioconductor. AU - van Wieringen, W.N.* AU - Unger, K. AU - Leday, G.G.R.* AU - Krijgsman, O.* AU - de Menezes, R.X.* AU - Ylstra, B.* AU - van de Wiel, M.A.* C1 - 10893 C2 - 30437 TI - Matching of array CGH and gene expression microarray features for the purpose of integrative genomic analyses. JO - BMC Bioinformatics VL - 13 IS - 1 PB - Biomed Central Ltd PY - 2012 SN - 1471-2105 ER - TY - JOUR AB - BACKGROUND: Genome-wide association studies (GWAS) based on single nucleotide polymorphisms (SNPs) revolutionized our perception of the genetic regulation of complex traits and diseases. Copy number variations (CNVs) promise to shed additional light on the genetic basis of monogenic as well as complex diseases and phenotypes. Indeed, the number of detected associations between CNVs and certain phenotypes are constantly increasing. However, while several software packages support the determination of CNVs from SNP chip data, the downstream statistical inference of CNV-phenotype associations is still subject to complicated and inefficient in-house solutions, thus strongly limiting the performance of GWAS based on CNVs. RESULTS: CONAN is a freely available client-server software solution which provides an intuitive graphical user interface for categorizing, analyzing and associating CNVs with phenotypes. Moreover, CONAN assists the evaluation process by visualizing detected associations via Manhattan plots in order to enable a rapid identification of genome-wide significant CNV regions. Various file formats including the information on CNVs in population samples are supported as input data. CONCLUSIONS: CONAN facilitates the performance of GWAS based on CNVs and the visual analysis of calculated results. CONAN provides a rapid, valid and straightforward software solution to identify genetic variation underlying the 'missing' heritability for complex traits that remains unexplained by recent GWAS. The freely available software can be downloaded at http://genepi-conan.i-med.ac.at. AU - Forer, L.* AU - Schönherr, S.* AU - Weissensteiner, H.* AU - Haider, F.* AU - Kluckner, T.* AU - Gieger, C. AU - Wichmann, H.-E. AU - Specht, G.* AU - Kronenberg, F.* AU - Kloss-Brandstätter, A.* C1 - 2040 C2 - 27381 TI - CONAN: Copy number variation analysis software for genome-wide association studies. JO - BMC Bioinformatics VL - 11 PB - BioMed Central Ltd. PY - 2010 SN - 1471-2105 ER - TY - JOUR AB - BACKGROUND: Extensive and automated data integration in bioinformatics facilitates the construction of large, complex biological networks. However, the challenge lies in the interpretation of these networks. While most research focuses on the unipartite or bipartite case, we address the more general but common situation of k-partite graphs. These graphs contain k different node types and links are only allowed between nodes of different types. In order to reveal their structural organization and describe the contained information in a more coarse-grained fashion, we ask how to detect clusters within each node type. RESULTS: Since entities in biological networks regularly have more than one function and hence participate in more than one cluster, we developed a k-partite graph partitioning algorithm that allows for overlapping (fuzzy) clusters. It determines for each node a degree of membership to each cluster. Moreover, the algorithm estimates a weighted k-partite graph that connects the extracted clusters. Our method is fast and efficient, mimicking the multiplicative update rules commonly employed in algorithms for non-negative matrix factorization. It facilitates the decomposition of networks on a chosen scale and therefore allows for analysis and interpretation of structures on various resolution levels. Applying our algorithm to a tripartite disease-gene-protein complex network, we were able to structure this graph on a large scale into clusters that are functionally correlated and biologically meaningful. Locally, smaller clusters enabled reclassification or annotation of the clusters' elements. We exemplified this for the transcription factor MECP2. CONCLUSIONS: In order to cope with the overwhelming amount of information available from biomedical literature, we need to tackle the challenge of finding structures in large networks with nodes of multiple types. To this end, we presented a novel fuzzy k-partite graph partitioning algorithm that allows the decomposition of these objects in a comprehensive fashion. We validated our approach both on artificial and real-world data. It is readily applicable to any further problem. AU - Hartsperger, M.L. AU - Blöchl, F. AU - Stuempflen, V. AU - Theis, F.J. C1 - 4678 C2 - 27219 TI - Structuring heterogeneous biological information using fuzzy clustering of k-partite graphs. JO - BMC Bioinformatics VL - 11 PB - Biomed Central Ltd PY - 2010 SN - 1471-2105 ER - TY - JOUR AB - External stimulations of cells by hormones, cytokines or growth factors activate signal transduction pathways that subsequently induce a re-arrangement of cellular gene expression. The analysis of such changes is complicated, as they consist of multi-layered temporal responses. While classical analyses based on clustering or gene set enrichment only partly reveal this information, matrix factorization techniques are well suited for a detailed temporal analysis. In signal processing, factorization techniques incorporating data properties like spatial and temporal correlation structure have shown to be robust and computationally efficient. However, such correlation-based methods have so far not be applied in bioinformatics, because large scale biological data rarely imply a natural order that allows the definition of a delayed correlation function. We therefore develop the concept of graph-decorrelation. We encode prior knowledge like transcriptional regulation, protein interactions or metabolic pathways in a weighted directed graph. By linking features along this underlying graph, we introduce a partial ordering of the features (e.g. genes) and are thus able to define a graph-delayed correlation function. Using this framework as constraint to the matrix factorization task allows us to set up the fast and robust graph-decorrelation algorithm (GraDe). To analyze alterations in the gene response in IL-6 stimulated primary mouse hepatocytes, we performed a time-course microarray experiment and applied GraDe. In contrast to standard techniques, the extracted time-resolved gene expression profiles showed that IL-6 activates genes involved in cell cycle progression and cell division. Genes linked to metabolic and apoptotic processes are down-regulated indicating that IL-6 mediated priming renders hepatocytes more responsive towards cell proliferation and reduces expenditures for the energy metabolism. GraDe provides a novel framework for the decomposition of large-scale 'omics' data. We were able to show that including prior knowledge into the separation task leads to a much more structured and detailed separation of the time-dependent responses upon IL-6 stimulation compared to standard methods. A Matlab implementation of the GraDe algorithm is freely available at http://cmb.helmholtz-muenchen.de/grade. AU - Kowarsch, A. AU - Blöchl, F. AU - Bohl, S.* AU - Saile, M.* AU - Gretz, N.* AU - Klingmüller, U.* AU - Theis, F.J. C1 - 1801 C2 - 28102 TI - Knowledge-based matrix factorization temporally resolves the cellular responses to IL-6 stimulation. JO - BMC Bioinformatics VL - 11 PB - BioMed Central Ltd. PY - 2010 SN - 1471-2105 ER - TY - JOUR AB - Phenomenological information about regulatory interactions is frequently available and can be readily converted to Boolean models. Fully quantitative models, on the other hand, provide detailed insights into the precise dynamics of the underlying system. In order to connect discrete and continuous modeling approaches, methods for the conversion of Boolean systems into systems of ordinary differential equations have been developed recently. As biological interaction networks have steadily grown in size and complexity, a fully automated framework for the conversion process is desirable. We present Odefy, a MATLAB- and Octave-compatible toolbox for the automated transformation of Boolean models into systems of ordinary differential equations. Models can be created from sets of Boolean equations or graph representations of Boolean networks. Alternatively, the user can import Boolean models from the CellNetAnalyzer toolbox, GINSim and the PBN toolbox. The Boolean models are transformed to systems of ordinary differential equations by multivariate polynomial interpolation and optional application of sigmoidal Hill functions. Our toolbox contains basic simulation and visualization functionalities for both, the Boolean as well as the continuous models. For further analyses, models can be exported to SQUAD, GNA, MATLAB script files, the SB toolbox, SBML and R script files. Odefy contains a user-friendly graphical user interface for convenient access to the simulation and exporting functionalities. We illustrate the validity of our transformation approach as well as the usage and benefit of the Odefy toolbox for two biological systems: a mutual inhibitory switch known from stem cell differentiation and a regulatory network giving rise to a specific spatial expression pattern at the mid-hindbrain boundary. Odefy provides an easy-to-use toolbox for the automatic conversion of Boolean models to systems of ordinary differential equations. It can be efficiently connected to a variety of input and output formats for further analysis and investigations. The toolbox is open-source and can be downloaded at http://cmb.helmholtz-muenchen.de/odefy. AU - Krumsiek, J. AU - Pölsterl, S. AU - Wittmann, D.M. AU - Theis, F.J. C1 - 3037 C2 - 28104 SP - 1-10 TI - Odefy - from discrete to continuous models. JO - BMC Bioinformatics VL - 11 PB - BioMed Central Ltd. PY - 2010 SN - 1471-2105 ER - TY - JOUR AB - Background: Virtually all currently available microRNA target site prediction algorithms require the presence of a (conserved) seed match to the 5' end of the microRNA. Recently however, it has been shown that this equirement might be too stringent, leading to a substantial number of missed target sites. Results: We developed TargetSpy, a novel computational approach for predicting target sites regardless of the presence of a seed match. It is based on machine learning and automatic feature selection using a wide spectrum of compositional, structural, and base pairing features covering current biological knowledge. Our model does not rely on evolutionary conservation, which allows the detection of species-specific interactions and makes TargetSpy suitable for analyzing unconserved genomic sequences. In order to allow for an unbiased comparison of TargetSpy to other methods, we classified all algorithms into three groups: I) no seed match requirement, II) seed match requirement, and III) conserved seed match requirement. TargetSpy predictions for classes II and III are generated by appropriate postfiltering. On a human dataset revealing fold-change in protein production for five selected microRNAs our method shows superior performance in all classes. In Drosophila melanogaster not only our class II and III predictions are on par with other algorithms, but notably the class I (no-seed) predictions are just marginally less accurate. We estimate that TargetSpy predicts between 26 and 112 functional target sites without a seed match per microRNA that are missed by all other currently available algorithms. Conclusion: Only a few algorithms can predict target sites without demanding a seed match and TargetSpy demonstrates a substantial improvement in prediction accuracy in that class. Furthermore, when conservation and the presence of a seed match are required, the performance is comparable with state-of-the-art algorithms. TargetSpy was trained on mouse and performs well in human and drosophila, suggesting that it may be applicable to a broad range of species. Moreover, we have demonstrated that the application of machine learning techniques in combination with upcoming deep sequencing data results in a powerful microRNA target site prediction tool http://www.targetspy.org. AU - Sturm, M. AU - Hackenberg, M.* AU - Langenberger, D.* AU - Frishman, D. C1 - 2946 C2 - 28117 TI - TargetSpy: A supervised machine learning approach for microRNA target prediction. JO - BMC Bioinformatics VL - 11 PB - Biomed Central Ltd. PY - 2010 SN - 1471-2105 ER - TY - JOUR AB - BACKGROUND: Genes show different sensitivities in expression corresponding to various biological conditions. Systematical study of this concept is required because of its important implications in microarray analysis etc. J.H. Ohn et al. first studied this gene property with yeast transcriptional profiling data. RESULTS: Here we propose a calculation framework for gene expression sensitivity analysis. We also compared the functions, centralities and transcriptional regulations of the sensitive and robust genes. We found that the robust genes tended to be involved in essential cellular processes. Oppositely, the sensitive genes perform their functions diversely. Moreover while genes from both groups show similar geometric centrality by coupling them onto integrated protein networks, the robust genes have higher vertex degree and betweenness than that of the sensitive genes. An interesting fact was also found that, not alike the sensitive genes, the robust genes shared less transcription factors as their regulators. CONCLUSION: Our study reveals different propensities of gene expression to external perturbations, demonstrates different roles of sensitive genes and robust genes in the cell and proposes the necessity of combining the gene expression sensitivity in the microarray analysis. AU - Hao, P.* AU - Zheng, S.* AU - Ping, J.* AU - Tu, K.* AU - Gieger, C. AU - Wang-Sattler, R. AU - Zhong, Y.* AU - Li, Y.* C1 - 82 C2 - 26384 TI - Human gene expression sensitivity according to large scale meta-analysis. JO - BMC Bioinformatics VL - 10 IS - SUPPL. 1 PB - BioMed Central PY - 2009 SN - 1471-2105 ER - TY - JOUR AB - Large-scale, comprehensive and standardized high-throughput mouse phenotyping has been established as a tool of functional genome research by the German Mouse Clinic and others. In all these projects, vast amounts of data are continuously generated and need to be stored, prepared for data-mining procedures and eventually be made publicly available. Thus, central storage and integrated management of mouse phenotype data, genotype data, metadata and linked external data are highly important. Requirements most probably depend on the individual mouse housing unit or project and the demand for either very specific individual database solutions or very flexible solutions that can be easily adapted to local demands. Not every group has the resources and/or the know-how to develop software for this purpose. A database application has been developed for the German Mouse Clinic in order to meet all requirements mentioned above. RESULTS: We present MausDB, the German Mouse Clinic web-based database application that integrates standard mouse colony management, phenotyping workflow scheduling features and mouse phenotyping result data management. It links mouse phenotype data with genotype data, metadata and external data such as public web databases, which is a prerequisite for comprehensive data analysis and mining. We describe how this can be achieved with a lean and user-friendly system built on open standards. CONCLUSION: MausDB is suited for large-scale, high-throughput phenotyping facilities but can also be used exclusively for mouse colony management within smaller units or projects. The system is successfully used as the primary mouse and data management tool of the German Mouse Clinic and other mouse facilities. We offer MausDB to the scientific community as open source software to provide a system for storage of data from functional genomics projects in a well-structured, easily accessible form. AU - Maier, H. AU - Lengger, C. AU - Simic, B.* AU - Fuchs, H. AU - Gailus-Durner, V. AU - Hrabě de Angelis, M. C1 - 2006 C2 - 25197 TI - MausDB: An open source application for phenotype data and mouse colony management in large-scale mouse phenotyping projects. JO - BMC Bioinformatics VL - 9 PB - BioMed Central PY - 2008 SN - 1471-2105 ER - TY - JOUR AB - Unsupervised annotation of proteins by software pipelines suffers from very high error rates. Spurious functional assignments are usually caused by unwarranted homology-based transfer of information from existing database entries to the new target sequences. We have previously demonstrated that data mining in large sequence annotation databanks can help identify annotation items that are strongly associated with each other, and that exceptions from strong positive association rules often point to potential annotation errors. Here we investigate the applicability of negative association rule mining to revealing erroneously assigned annotation items. RESULTS: Almost all exceptions from strong negative association rules are connected to at least one wrong attribute in the feature combination making up the rule. The fraction of annotation features flagged by this approach as suspicious is strongly enriched in errors and constitutes about 0.6% of the whole body of the similarity-transferred annotation in the PEDANT genome database. Positive rule mining does not identify two thirds of these errors. The approach based on exceptions from negative rules is much more specific than positive rule mining, but its coverage is significantly lower. CONCLUSION: Mining of both negative and positive association rules is a potent tool for finding significant trends in protein annotation and flagging doubtful features for further inspection. AU - Artamonova, I.I. AU - Frishman, G. AU - Frishman, D. C1 - 4144 C2 - 24906 TI - Applying negative rule mining to improve genome annotation. JO - BMC Bioinformatics VL - 8 PB - Biomed Central PY - 2007 SN - 1471-2105 ER - TY - JOUR AB - Apollo, a genome annotation viewer and editor, has become a widely used genome annotation and visualization tool for distributed genome annotation projects. When using Apollo for annotation, database updates are carried out by uploading intermediate annotation files into the respective database. This non-direct database upload is laborious and evokes problems of data synchronicity. RESULTS: To overcome these limitations we extended the Apollo data adapter with a generic, configurable web service client that is able to retrieve annotation data in a GAME-XML-formatted string and pass it on to Apollo's internal input routine. CONCLUSION: This Apollo web service adapter, Apollo2Go, simplifies the data exchange in distributed projects and aims to render the annotation process more comfortable. AU - Klee, K. AU - Ernst, R. AU - Spannagl, M. AU - Mayer, K.F.X. C1 - 5477 C2 - 24690 TI - Apollo2Go: A web service adapter for the Apollo genome viewer to enable distributed genome annotation. JO - BMC Bioinformatics VL - 8 PB - BioMed Central PY - 2007 SN - 1471-2105 ER - TY - JOUR AB - BACKGROUND: Alternative splicing is a major mechanism of generating protein diversity in higher eukaryotes. Although at least half, and probably more, of mammalian genes are alternatively spliced, it was not clear, whether the frequency of alternative splicing is the same in different functional categories. The problem is obscured by uneven coverage of genes by ESTs and a large number of artifacts in the EST data. RESULTS: We have developed a method that generates possible mRNA isoforms for human genes contained in the EDAS database, taking into account the effects of nonsense-mediated decay and translation initiation rules, and a procedure for offsetting the effects of uneven EST coverage. Then we computed the number of mRNA isoforms for genes from different functional categories. Genes encoding ribosomal proteins and genes in the category "Small GTPase-mediated signal transduction" tend to have fewer isoforms than the average, whereas the genes in the category "DNA replication and chromosome cycle" have more isoforms than the average. Genes encoding proteins involved in protein-protein interactions tend to be alternatively spliced more often than genes encoding non-interacting proteins, although there is no significant difference in the number of isoforms of alternatively spliced genes. CONCLUSION: Filtering for functional isoforms satisfying biological constraints and accounting for uneven EST coverage allowed us to describe differences in alternative splicing of genes from different functional categories. The observations seem to be consistent with expectations based on current biological knowledge: less isoforms for ribosomal and signal transduction proteins, and more alternative splicing of interacting and cell cycle proteins. AU - Neverov, A.D.* AU - Artamonova, I.I. AU - Nurtdinov, R.N.* AU - Frishman, D. AU - Gelfand, M.S.* AU - Mironov, A.A.* C1 - 5526 C2 - 23349 TI - Alternative splicing and protein function. JO - BMC Bioinformatics VL - 6 PB - BioMed Central PY - 2005 SN - 1471-2105 ER - TY - JOUR AB - BACKGROUND: Detection of sequence homologues represents a challenging task that is important for the discovery of protein families and the reliable application of automatic annotation methods. The presence of domains in protein families of diverse function, inhomogeneity and different sizes of protein families create considerable difficulties for the application of published clustering methods. RESULTS: Our work analyses the Super Paramagnetic Clustering (SPC) and its extension, global SPC (gSPC) algorithm. These algorithms cluster input data based on a method that is analogous to the treatment of an inhomogeneous ferromagnet in physics. For the SwissProt and SCOP databases we show that the gSPC improves the specificity and sensitivity of clustering over the original SPC and Markov Cluster algorithm (TRIBE-MCL) up to 30%. The three algorithms provided similar results for the MIPS FunCat 1.3 annotation of four bacterial genomes, Bacillus subtilis, Helicobacter pylori, Listeria innocua and Listeria monocytogenes. However, the gSPC covered about 12% more sequences compared to the other methods. The SPC algorithm was programmed in house using C++ and it is available at http://mips.gsf.de/proj/spc. The FunCat annotation is available at http://mips.gsf.de. CONCLUSION: The gSPC calculated to a higher accuracy or covered a larger number of sequences than the TRIBE-MCL algorithm. Thus it is a useful approach for automatic detection of protein families and unsupervised annotation of full genomes. AU - Tetko, I.V. AU - Facius, A. AU - Ruepp, A. AU - Mewes, H.-W. C1 - 5525 C2 - 23348 TI - Super paramagnetic clustering of protein sequences. JO - BMC Bioinformatics VL - 6 PB - Biomed Central Ltd PY - 2005 SN - 1471-2105 ER - TY - JOUR AB - BACKGROUND: The massive amount of SNP data stored at public internet sites provides unprecedented access to human genetic variation. Selecting target SNP for disease-gene association studies is currently done more or less randomly as decision rules for the selection of functional relevant SNPs are not available. RESULTS: We implemented a computational pipeline that retrieves the genomic sequence of target genes, collects information about sequence variation and selects functional motifs containing SNPs. Motifs being considered are gene promoter, exon-intron structure, AU-rich mRNA elements, transcription factor binding motifs, cryptic and enhancer splice sites together with expression in target tissue. As a case study, 396 genes on chromosome 6p21 in the extended HLA region were selected that contributed nearly 20,000 SNPs. By computer annotation ~2,500 SNPs in functional motifs could be identified. Most of these SNPs are disrupting transcription factor binding sites but only those introducing new sites had a significant depressing effect on SNP allele frequency. Other decision rules concern position within motifs, the validity of SNP database entries, the unique occurrence in the genome and conserved sequence context in other mammalian genomes. CONCLUSION: Only 10% of all gene-based SNPs have sequence-predicted functional relevance making them a primary target for genotyping in association studies. AU - Wjst, M. C1 - 4632 C2 - 22302 TI - Target SNP selection in complex disease association studies. JO - BMC Bioinformatics VL - 5 PY - 2004 SN - 1471-2105 ER -