TY - JOUR AB - The efficiency of machine learning (ML) models is crucial to minimize inference times and reduce the carbon footprints of models deployed in production environments. Current models employed in retrosynthesis to generate a synthesis route from a target molecule to purchasable compounds are prohibitively slow. The model operates in a single-step fashion in a tree search algorithm by predicting reactant molecules given a product molecule as input. In this study, we investigate the ability of alternative transformer architectures, knowledge distillation (KD), and simple hyper-parameter optimization to decrease inference times of the Chemformer model. Initially, we assess the ability of closely related transformer architectures and conclude that these models under-performed when using KD. Additionally, we investigate the effects of feature-based and response-based KD together with hyper-parameters optimized based on inference sample time and model accuracy. We find that although reducing model size and improving single-step speed are important, our results indicate that multi-step search efficiency is more significantly influenced by the diversity and confidence of single-step models. Based on this work, further research should use KD in combination with other techniques, as multi-step speed continues to prevent proper integration of synthesis planning. However, in Monte Carlo-based (MC) multi-step retrosynthesis, other factors play a crucial role in balancing exploration and exploitation during the search process, often outweighing the direct impact of single-step model speed and carbon footprints. AU - Hartog, P. AU - Westerlund, A.M.* AU - Tetko, I.V. AU - Genheden, S.* C1 - 73237 C2 - 56967 CY - 1155 16th St, Nw, Washington, Dc 20036 Usa SP - 1771-1781 TI - Investigations into the efficiency of computer-aided synthesis planning. JO - J. Chem. Inf. Model. VL - 65 IS - 4 PB - Amer Chemical Soc PY - 2025 SN - 0021-9576 ER - TY - JOUR AB - Machine Learning (ML) techniques face significant challenges when predicting advanced chemical properties, such as yield, feasibility of chemical synthesis, and optimal reaction conditions. These challenges stem from the high-dimensional nature of the prediction task and the myriad essential variables involved, ranging from reactants and reagents to catalysts, temperature, and purification processes. Successfully developing a reliable predictive model not only holds the potential for optimizing high-throughput experiments but can also elevate existing retrosynthetic predictive approaches and bolster a plethora of applications within the field. In this review, we systematically evaluate the efficacy of current ML methodologies in chemoinformatics, shedding light on their milestones and inherent limitations. Additionally, a detailed examination of a representative case study provides insights into the prevailing issues related to data availability and transferability in the discipline. AU - Voinarovska, V.* AU - Kabeshov, M.* AU - Dudenko, D.* AU - Genheden, S.* AU - Tetko, I.V. C1 - 69060 C2 - 53837 CY - 1155 16th St, Nw, Washington, Dc 20036 Usa SP - 42-56 TI - When yield prediction does not yield prediction: An overview of the current challenges. JO - J. Chem. Inf. Model. VL - 64 IS - 1 PB - Amer Chemical Soc PY - 2024 SN - 0021-9576 ER - TY - JOUR AB - We introduce ULYSSES, a user-friendly and robust C++ library for semiempirical quantum chemical calculations. In its current version, ULYSSES is equipped with a large set of different semiempirical models, most of which are based on the Neglect of Diatomic Differential Overlap (NDDO) approximation. Empirical corrections for dispersion and hydrogen bonding are available for most methods, so that higher quality is achieved in the calculation of energies of nonbonded complexes. The library is furthermore equipped with geometry optimization, as well as modules for calculating molecular properties of general interest. Ideal gas thermodynamics is available and allows single structure as well as conformer (multistructure) averaged properties to be calculated. We offer the possibility to use several vibrational partition functions according to the nature of interactions being studied: for covalent systems, the traditional harmonic oscillator approximation is available; for nonbonded complexes, we systematically extended the partition function proposed by Grimme for all thermodynamic functions. The library is also capable of running Born-Oppenheimer molecular dynamics. AU - Cardoso Micu Menezes, F.M. AU - Popowicz, G.M. C1 - 66025 C2 - 53058 SP - 3685-3694 TI - ULYSSES: An efficient and easy to use semiempirical library for C+. JO - J. Chem. Inf. Model. VL - 62 IS - 16 PY - 2022 SN - 0021-9576 ER - TY - JOUR AB - African and American trypanosomiases are estimated to affect several million people across the world, with effective treatments distinctly lacking. New, ideally oral, treatments with higher efficacy against these diseases are desperately needed. Peroxisomal import matrix (PEX) proteins represent a very interesting target for structure- and ligand-based drug design. The PEX5-PEX14 protein-protein interface in particular has been highlighted as a target, with inhibitors shown to disrupt essential cell processes in trypanosomes, leading to cell death. In this work, we present a drug development campaign that utilizes the synergy between structural biology, computer-aided drug design, and medicinal chemistry in the quest to discover and develop new potential compounds to treat trypanosomiasis by targeting the PEX14-PEX5 interaction. Using the structure of the known lead compounds discovered by Dawidowski et al. as the template for a chemically advanced template search (CATS) algorithm, we performed scaffold-hopping to obtain a new class of compounds with trypanocidal activity, based on 2,3,4,5-tetrahydrobenzo[f][1,4]oxazepines chemistry. The initial compounds obtained were taken forward to a first round of hit-to-lead optimization by synthesis of derivatives, which show activities in the range of low- to high-digit micromolar IC50 in the in vitro tests. The NMR measurements confirm binding to PEX14 in solution, while immunofluorescent microscopy indicates disruption of protein import into the glycosomes, indicating that the PEX14-PEX5 protein-protein interface was successfully disrupted. These studies result in development of a novel scaffold for future lead optimization, while ADME testing gives an indication of further areas of improvement in the path from lead molecules toward a new drug active against trypanosomes. AU - Fino, R. AU - Lenhart, D. AU - Kalel, V.C.* AU - Softley, C. AU - Napolitano, V. AU - Byrne, R.* AU - Schliebs, W.* AU - Dawidowski, M.* AU - Erdmann, R.* AU - Sattler, M. AU - Schneider, G.* AU - Plettenburg, O. AU - Popowicz, G.M. C1 - 63182 C2 - 51404 CY - 1155 16th St, Nw, Washington, Dc 20036 Usa SP - 5256-5268 TI - Computer-aided design and synthesis of a new class of PEX14 inhibitors: substituted 2,3,4,5-tetrahydrobenzo[F][1,4]oxazepines as potential new trypanocidal agents. JO - J. Chem. Inf. Model. VL - 61 IS - 10 PB - Amer Chemical Soc PY - 2021 SN - 0021-9576 ER - TY - JOUR AU - Tetko, I.V. AU - Tropsha, A.* C1 - 58717 C2 - 48279 SP - 1069-1071 TI - Joint virtual special issue on computational toxicology. JO - J. Chem. Inf. Model. VL - 60 IS - 3 PY - 2020 SN - 0021-9576 ER - TY - JOUR AB - Acute toxicity is one of the most challenging properties to predict purely with computational methods due to its direct relationship to biological interactions. Moreover, toxicity can be represented by different end points: it can be measured for different species using different types of administration, etc., and it is questionable if the knowledge transfer between end points is possible. We performed a comparative study of prediction multitask toxicity for a broad chemical space using different descriptors and modeling algorithms and applied multitask learning for a large toxicity data set extracted from the Registry of Toxic Effects of Chemical Substances (RTECS). We demonstrated that multitask modeling provides significant improvement over single-output models and other machine learning methods. Our research reveals that multitask learning can be very useful to improve the quality of acute toxicity modeling and raises a discussion about the usage of multitask approaches for regulation purposes. AU - Sosnin, S.* AU - Karlov, D.* AU - Tetko, I.V. AU - Fedorov, M.V.* C1 - 55347 C2 - 46110 CY - 1155 16th St, Nw, Washington, Dc 20036 Usa SP - 1062-1072 TI - Comparative study of multitask toxicity modeling on a broad chemical space. JO - J. Chem. Inf. Model. VL - 59 IS - 3 PB - Amer Chemical Soc PY - 2019 SN - 0021-9576 ER - TY - JOUR AB - Firefly luciferase is an enzyme that has found ubiquitous use in biological assays in high-throughput screening (HTS) campaigns. The inhibition of luciferase in such assays could lead to a false positive result. This issue has been known for a long time, and there have been significant efforts to identify luciferase inhibitors in order to enhance recognition of false positives in screening assays. However, although a large amount of publicly accessible luciferase counterscreen data is available, to date little effort has been devoted to building a chemoinformatic model that can identify such molecules in a given data set. In this study we developed models to identify these molecules using various methods, such as molecular docking, SMARTS screening, pharmacophores, and machine learning methods. Among the structure-based methods, the pharmacophore-based method showed promising results, with a balanced accuracy of 74.2%. However, machine-learning approaches using associative neural networks outperformed all of the other methods explored, producing a final model with a balanced accuracy of 89.7%. The high predictive accuracy of this model is expected to be useful for advising which compounds are potential luciferase inhibitors present in luciferase HTS assays. The models developed in this work are freely available at the OCHEM platform at http://ochem.eu. AU - Ghosh, D. AU - Koch, U.* AU - Hadian, K. AU - Sattler, M.* AU - Tetko, I.V. C1 - 53432 C2 - 44799 SP - 933-942 TI - Luciferase advisor: High-accuracy model to flag false positive hits in luciferase HTS assays. JO - J. Chem. Inf. Model. VL - 58 IS - 5 PY - 2018 SN - 0021-9576 ER - TY - JOUR AB - The CD154-CD40 receptor complex plays a pivotal role in several inflammatory pathways. Attempts to inhibit the formation of this complex have resulted in systemic side effects. Downstream inhibition of the CD40 signaling pathway therefore seems a better way to ameliorate inflammatory disease. To relay a signal, the CD40 receptor recruits adapter proteins called tumor necrosis factor receptor-associated factors (TRAFs). CD40-TRAF6 interactions are known to play an essential role in several inflammatory diseases. We used in silico, in vitro, and in vivo experiments to identify and characterize compounds that block CD40-TRAF6 interactions. We present in detail our drug docking and optimization pipeline and show how we used it to find lead compounds that reduce inflammation in models of peritonitis and sepsis. These compounds appear to be good leads for drug development, given the observed absence of side effects and their demonstrated efficacy for peritonitis and sepsis in mouse models. AU - Zarzycka, B.* AU - Seijkens, T.* AU - Nabuurs, S.B.* AU - Ritschel, T.* AU - Grommes, J.* AU - Soehnlein, O.* AU - Schrijver, R.S.* AU - van Tiel, C.M.* AU - Hackeng, T.M.* AU - Weber, C.* AU - Giehler, F. AU - Kieser, A. AU - Lutgens, E.* AU - Vriend, G.* AU - Nicolaes, G.A.F.* C1 - 43426 C2 - 36367 CY - Washington SP - 294-307 TI - Discovery of small molecule CD40-TRAF6 inhibitors. JO - J. Chem. Inf. Model. VL - 55 IS - 2 PB - Amer Chemical Soc PY - 2015 SN - 0021-9576 ER - TY - JOUR AB - In this study, we propose a novel approach to evaluate virtual screening (VS) experiments based on the analysis of docking output data. This approach, which we refer to as docking data feature analysis (DDFA), consists of two steps. First, a set of features derived from the docking output data is computed and assigned to each molecule in the virtually screened library. Second, an artificial neural network (ANN) analyzes the molecule's docking features and estimates its activity. Given the simple architecture of the ANN, DDFA can be easily adapted to deal with information from several docking programs simultaneously. We tested our approach on the Directory of Useful Decoys (DUD), a well-established and highly accepted VS benchmark. Outstanding results were obtained by DDFA not only in comparison with the conventional rankings of the docking programs used in this work but also with respect to other methods found in the literature. Our approach performs with similar good results as the best available methods, which, however, also require substantially more computing time, economic resources, and/or expert intervention. Taken together, DDFA represents an automatic and highly attractive methodology for VS. AU - Arciniega, M.* AU - Lange, O.F. C1 - 31529 C2 - 34510 CY - Washington SP - 1401-1411 TI - Improvement of virtual screening results by docking data feature analysis. JO - J. Chem. Inf. Model. VL - 54 IS - 5 PB - Amer Chemical Soc PY - 2014 SN - 0021-9576 ER - TY - JOUR AB - This article contributes a highly accurate model for predicting the melting points (MPs) of medicinal chemistry compounds. The model was developed using the largest published data set, comprising more than 47k compounds. The distributions of MPs in drug-like and drug lead sets showed that >90% of molecules melt within [50,250]°C. The final model calculated an RMSE of less than 33 °C for molecules from this temperature interval, which is the most important for medicinal chemistry users. This performance was achieved using a consensus model that performed calculations to a significantly higher accuracy than the individual models. We found that compounds with reactive and unstable groups were overrepresented among outlying compounds. These compounds could decompose during storage or measurement, thus introducing experimental errors. While filtering the data by removing outliers generally increased the accuracy of individual models, it did not significantly affect the results of the consensus models. Three analyzed distance to models did not allow us to flag molecules, which had MP values fell outside the applicability domain of the model. We believe that this negative result and the public availability of data from this article will encourage future studies to develop better approaches to define the applicability domain of models. The final model, MP data, and identified reactive groups are available online at http://ochem.eu/article/55638 . AU - Tetko, I.V. AU - Sushko, Y.* AU - Novotarskyi, S.* AU - Patiny, L.* AU - Kondratov, I.* AU - Petrenko, A.E.* AU - Charochkina, L.* AU - Asiri, A.M.* C1 - 42927 C2 - 35875 SP - 3320-3329 TI - How accurately can we predict the melting points of drug-like compounds? JO - J. Chem. Inf. Model. VL - 54 IS - 12 PY - 2014 SN - 0021-9576 ER - TY - JOUR AB - The dimethyl sulfoxide (DMSO) solubility data from Enamine and two UCB pharma compound collections were analyzed using 8 different machine learning methods and 12 descriptor sets. The analyzed data sets were highly imbalanced with 1.7-5.8% nonsoluble compounds. The libraries' enrichment by soluble molecules from the set of 10% of the most reliable predictions was used to compare prediction performances of the methods. The highest accuracies were calculated using a C4.5 decision classification tree, random forest, and associative neural networks. The performances of the methods developed were estimated on individual data sets and their combinations. The developed models provided on average a 2-fold decrease of the number of nonsoluble compounds amid all compounds predicted as soluble in DMSO. However, a 4-9-fold enrichment was observed if only 10% of the most reliable predictions were considered. The structural features influencing compounds to be soluble or nonsoluble in DMSO were also determined. The best models developed with the publicly available Enamine data set are freely available online at http://ochem.eu/article/33409 . AU - Tetko, I.V. AU - Novotarskyi, S.* AU - Sushko, I.* AU - Ivanov, V.* AU - Petrenko, A.E.* AU - Dieden, R.* AU - Lebon, F.* AU - Mathieu, B.* C1 - 27475 C2 - 32688 SP - 1990-2000 TI - Development of dimethyl sulfoxide solubility models using 163 000 molecules: Using a domain applicability metric to select more reliable predictions. JO - J. Chem. Inf. Model. VL - 53 IS - 8 PB - Amer. Chemical Soc. PY - 2013 SN - 0021-9576 ER - TY - JOUR AB - Several applications, such as risk assessment within REACH or drug discovery, require reliable methods for the design of experiments and efficient testing strategies. Keeping the number of experiments as low as possible is important from both a financial and an ethical point of view, as exhaustive testing of compounds requires significant financial resources and animal lives. With a large initial set of compounds, experimental design techniques can be used to select a representative subset for testing. Once measured, these compounds can be used to develop quantitative structure activity relationship models to predict properties of the remaining compounds. This reduces the required resources and time. D-Optimal design is frequently used to select an optimal set of compounds by analyzing data variance. We developed a new sequential approach to apply a D-Optimal design to latent variables derived from a partial least squares (PLS) model instead of principal components. The stepwise procedure selects a new set of molecules to be measured after each previous measurement cycle. We show that application of the D-Optimal selection generates models with a significantly improved performance on four different data sets with end points relevant for REACH. Compared to those derived from principal components, PLS models derived from the selection on latent variables had a lower root-mean-square error and a higher Q2 and R2. This improvement is statistically significant, especially for the small number of compounds selected. AU - Brandmaier, S. AU - Sahlin, U.* AU - Tetko, I.V. AU - Öberg, T.* C1 - 8009 C2 - 29983 SP - 975-983 TI - PLS-optimal: A stepwise D-optimal design based on latent variables. JO - J. Chem. Inf. Model. VL - 52 IS - 4 PB - Amer. Chemical Soc. PY - 2012 SN - 0021-9576 ER - TY - JOUR AB - The article presents a Web-based platform for collecting and storing toxicological structural alerts from literature and for virtual screening of chemical libraries to flag potentially toxic chemicals and compounds that can cause adverse side effects. An alert is uniquely identified by a SMARTS template, a toxicological endpoint, and a publication where the alert was described. Additionally, the system allows storing complementary information such as name, comments, and mechanism of action, as well as other data. Most importantly, the platform can be easily used for fast virtual screening of large chemical datasets, focused libraries, or newly designed compounds against the toxicological alerts, providing a detailed profile of the chemicals grouped by structural alerts and endpoints. Such a facility can be used for decision making regarding whether a compound should be tested experimentally, validated with available QSAR models, or eliminated from consideration altogether. The alert-based screening can also be helpful for an easier interpretation of more complex QSAR models. The system is publicly accessible and tightly integrated with the Online Chemical Modeling Environment (OCHEM, http://ochem.eu). The system is open and expandable: any registered OCHEM user can introduce new alerts, browse, edit alerts introduced by other users, and virtually screen his/her data sets against all or selected alerts. The user sets being passed through the structural alerts can be used at OCHEM for other typical tasks: exporting in a wide variety of formats, development of QSAR models, additional filtering by other criteria, etc. The database already contains almost 600 structural alerts for such endpoints as mutagenicity, carcinogenicity, skin sensitization, compounds that undergo metabolic activation, and compounds that form reactive metabolites and, thus, can cause adverse reactions. The ToxAlerts platform is accessible on the Web at http://ochem.eu/alerts, and it is constantly growing. AU - Sushko, I.* AU - Salmina, E.* AU - Potemkin, V.A.* AU - Poda, G.* AU - Tetko, I.V. C1 - 11845 C2 - 30830 SP - 2310-2316 TI - ToxAlerts: A web server of structural alerts for toxic chemicals and compounds with potential adverse reactions. JO - J. Chem. Inf. Model. VL - 52 IS - 8 PB - American Chemical Society PY - 2012 SN - 0021-9576 ER - TY - JOUR AB - Prediction of CYP450 inhibition activity of small molecules poses an important task due to high risk of drug-drug interactions. CYP1A2 is an important member of CYP450 superfamily and accounts for 15% of total CYP450 presence in human liver. This article compares 80 in-silico QSAR models that were created by following the same procedure with different combinations of descriptors and machine learning methods. The training and test sets consist of 3745 and 3741 inhibitors and noninhibitors from PubChem BioAssay database. A heterogeneous external test set of 160 inhibitors was collected from literature. The studied descriptor sets involve E-state, Dragon and ISIDA SMF descriptors. Machine learning methods involve Associative Neural Networks (ASNN), K Nearest Neighbors (kNN), Random Tree (RT), C4.5 Tree (J48), and Support Vector Machines (SVM). The influence of descriptor selection on model accuracy was studied. The benefits of "bagging" modeling approach were shown. Applicability domain approach was successfully applied in this study and ways of increasing model accuracy through use of applicability domain measures were demonstrated as well as fragment-based model interpretation was performed. The most accurate models in this study achieved values of 83% and 68% correctly classified instances on the internal and external test sets, respectively. The applicability domain approach allowed increasing the prediction accuracy to 90% for 78% of the internal and 17% of the external test sets, respectively. The most accurate models are available online at http://ochem.eu/models/Q5747 . AU - Novotarskyi, S. AU - Sushko, I. AU - Körner, R. AU - Pandey, A.K. AU - Tetko, I.V. C1 - 5449 C2 - 29072 SP - 1271-1280 TI - A comparison of different QSAR approaches to modeling CYP450 1A2 inhibition. JO - J. Chem. Inf. Model. VL - 51 IS - 6 PB - Am. Chemical Soc. PY - 2011 SN - 0021-9576 ER - TY - JOUR AB - The estimation of accuracy and applicability of QSAR and QSPR models for biological and physicochemical properties represents a critical problem. The developed parameter of "distance to model" (DM) is defined as a metric of similarity between the training and test set compounds that have been subjected to QSAR/QSPR modeling. In our previous work, we demonstrated the utility and optimal performance of DM metrics that have been based on the standard deviation within an ensemble of QSAR models. The current study applies such analysis to 30 QSAR models for the Ames mutagenicity data set that were previously reported within the 2009 QSAR challenge. We demonstrate that the DMs based on an ensemble (consensus) model provide systematically better performance than other DMs. The presented approach identifies 30-60% of compounds having an accuracy of prediction similar to the interlaboratory accuracy of the Ames test, which is estimated to be 90%. Thus, the in silico predictions can be used to halve the cost of experimental measurements by providing a similar prediction accuracy. The developed model has been made publicly available at http://ochem.eu/models/1 . AU - Sushko, I. AU - Novotarskyi, S. AU - Körner, R. AU - Pandey, A.K. AU - Cherkasov, A.* AU - Li, J.* AU - Gramatica, P.* AU - Hansen, K.* AU - Schroeter, T.* AU - Müller, K.R.* AU - Xi, L.* AU - Liu, H.* AU - Yao, X.* AU - Öberg, T.* AU - Hormozdiari, F.* AU - Dao, P.* AU - Sahinalp, C.* AU - Todeschini, R.* AU - Polishchuk, P.* AU - Artemenko, A.* AU - Kuz'min, V.* AU - Martin, T.M.* AU - Young, D.M.* AU - Fourches, D.* AU - Muratov, E.* AU - Tropsha, A.* AU - Baskin, I.* AU - Horvath, D.* AU - Marcou, G.* AU - Müller, C.* AU - Varnek, A.* AU - Prokopenko, V.V.* AU - Tetko, I.V. C1 - 5165 C2 - 27865 SP - 2094-2111 TI - Applicability domains for classification problems: Benchmarking of distance to models for ames mutagenicity set. JO - J. Chem. Inf. Model. VL - 50 IS - 12 PB - American Chemical Society PY - 2010 SN - 0021-9576 ER - TY - JOUR AB - Two inductive knowledge transfer approaches - multitask learning (MTL) and Feature Net (FN) - have been used to build predictive neural networks (ASNN) and PLS models for I I types of tissue-air partition coefficients (TAPC). Unlike conventional single-task learning (STL) modeling focused only on a single target property without any relations to other properties, in the framework of inductive transfer approach, the individual models are viewed as nodes in the network of interrelated models built in parallel (MTL) or sequentially (FN). It has been demonstrated that MTL and FN techniques are extremely useful in structure-property modeling on small and structurally diverse data sets, when conventional STL modeling is unable to produce any predictive model. The predictive STL individual models were obtained for 4 out of I I TAPC, whereas application of inductive knowledge transfer techniques resulted in models for 9 TAPC. Differences in prediction performances of the models as a function of the machine-learning method, and of the number of properties simultaneously involved in the learning, has been discussed. AU - Varnek, A.* AU - Gaudin, C.* AU - Marcou, G.* AU - Baskin, I.* AU - Pandey, A.K. AU - Tetko, I.V. C1 - 1271 C2 - 26848 SP - 133-144 TI - Inductive transfer of knowledge: Application of multi-task learning and feature net approaches to model tissue-air partition coefficients. JO - J. Chem. Inf. Model. VL - 49 IS - 1 PB - Amer Chemical Soc PY - 2009 SN - 0021-9576 ER - TY - JOUR AB - We present the application of a Java remote method invocation (RMI) based open source architecture to distributed chemical computing. This architecture was previously employed for distributed data harvesting of chemical information from the Internet via the Google application programming interface (API; ChemXtreme). Due to its open source character and its flexibility, the underlying server/client framework can be quickly adopted to virtually every computational task that can be parallelized. Here, we present the server/client communication framework as well as an application to distributed computing of chemical properties on a large scale (currently the size of PubChem; about 18 million compounds), using both the Marvin toolkit as well as the open source JOELib package. As an application, for this set of compounds, the agreement of log P and TPSA between the packages was compared. Outliers were found to be mostly non-druglike compounds and differences could usually be explained by differences in the underlying algorithms. ChemStar is the first open source distributed chemical computing environment built on Java RMI, which is also easily adaptable to user demands due to its "plug-in architecture". The complete source codes as well as calculated properties along with links to PubChem resources are available on the Internet via a graphical user interface at http://moltable.ncl.res.in/chemstar/. AU - Karthikeyan, M.* AU - Krishnan, S.* AU - Pandey, A.K. AU - Bender, A.* AU - Tropsha, A.* C1 - 1491 C2 - 25529 SP - 691-703 TI - Distributed chemical computing using ChemStar: An open source java remote method invocation architecture applied to large scale molecular data from PubChem. JO - J. Chem. Inf. Model. VL - 48 IS - 4 PB - American Chemical Society PY - 2008 SN - 0021-9576 ER - TY - JOUR AB - The estimation of the accuracy of predictions is a critical problem in QSAR modeling. The "distance to model" can be defined as a metric that defines the similarity between the training set molecules and the test set compound for the given property in the context of a specific model. It could be expressed in many different ways, e.g., using Tanimoto coefficient, leverage, correlation in space of models, etc. In this paper we have used mixtures of Gaussian distributions as well as statistical tests to evaluate six types of distances to models with respect to their ability to discriminate compounds with small and large prediction errors. The analysis was performed for twelve QSAR models of aqueous toxicity against T. pyriformis obtained with different machine-learning methods and various types of descriptors. The distances to model based oil standard deviation of predicted toxicity calculated from the ensemble of models afforded the best results. This distance also successfully discriminated molecules with low and large prediction errors for a mechanism-based model developed using log P and the Maximum Acceptor Superdelocalizability descriptors. Thus, the distance to model metric could also be used to augment mechanistic QSAR models by estimating their prediction errors. Moreover, the accuracy of prediction is mainly determined by the training set data distribution in the chemistry and activity spaces but not by QSAR approaches used to develop the models. We have shown that incorrect validation of a model may result in the wrong estimation of its performance and suggested how this problem could be circumvented. The toxicity of 3182 and 48774 molecules from the EPA High Production Volume (HPV) Challenge Program and EINECS (European chemical Substances Information System), respectively, was predicted, and the accuracy of prediction was estimated. The developed models are available online at http://www.qspr.org site AU - Tetko, I.V. AU - Sushko, I. AU - Pandey, A.K. AU - Zhu, H. AU - Tropsha, A.* AU - Papa, E.* AU - Öberg, T.* AU - Todeschini, R.* AU - Fourches, D.* AU - Varnek, A.* C1 - 139 C2 - 25947 SP - 1733-1746 TI - Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: Focusing on applicability domain and overfitting by variable selection. JO - J. Chem. Inf. Model. VL - 48 IS - 9 PB - American Chemical Society PY - 2008 SN - 0021-9576 ER - TY - JOUR AB - Selecting most rigorous quantitative structure-activity relationship (QSAR) approaches is of great importance in the development of robust and predictive models of chemical toxicity. To address this issue in a systematic way, we have formed an international virtual collaboratory consisting of six independent groups with shared interests in computational chemical toxicology. We have compiled an aqueous toxicity data set containing 983 unique compounds tested in the same laboratory over a decade against Tetrahymena pyriformis. A modeling set including 644 compounds was selected randomly from the original set and distributed to all groups that used their own QSAR tools for model development. The remaining 339 compounds in the original set (external set I) as well as 110 additional compounds (external set II) published recently by the same laboratory (after this computational study was already in progress) were used as two independent validation sets to assess the external predictive power of individual models. In total, our virtual collaboratory has developed 15 different types of QSAR models of aquatic toxicity for the training set. The internal prediction accuracy for the modeling set ranged from 0.76 to 0.93 as measured by the leave-one-out cross-validation correlation coefficient ( Q abs2). The prediction accuracy for the external validation sets I and II ranged from 0.71 to 0.85 (linear regression coefficient R absI2) and from 0.38 to 0.83 (linear regression coefficient R absII2), respectively. The use of an applicability domain threshold implemented in most models generally improved the external prediction accuracy but at the same time led to a decrease in chemical space coverage. Finally, several consensus models were developed by averaging the predicted aquatic toxicity for every compound using all 15 models, with or without taking into account their respective applicability domains. We find that consensus models afford higher prediction accuracy for the external validation data sets with the highest space coverage as compared to individual constituent models. Our studies prove the power of a collaborative and consensual approach to QSAR model development. The best validated models of aquatic toxicity developed by our collaboratory (both individual and consensus) can be used as reliable computational predictors of aquatic toxicity and are available from any of the participating laboratories. AU - Zhu, H.* AU - Tropsha, A.* AU - Fourches, D.* AU - Varnek, A.* AU - Papa, E.* AU - Gramatica, P.* AU - Öberg, T.* AU - Dao, P. AU - Cherkasov, A.* AU - Tetko, I.V. C1 - 1785 C2 - 25523 SP - 766-784 TI - Combinatorial QSAR modeling of chemical toxicants tested against Tetrahymena pyriformis. JO - J. Chem. Inf. Model. VL - 48 IS - 4 PB - American Chemical Society PY - 2008 SN - 0021-9576 ER - TY - JOUR AB - Several popular machine learning methods--Associative Neural Networks (ANN), Support Vector Machines (SVM), k Nearest Neighbors (kNN), modified version of the partial least-squares analysis (PLSM), backpropagation neural network (BPNN), and Multiple Linear Regression Analysis (MLR)--implemented in ISIDA, NASAWIN, and VCCLAB software have been used to perform QSPR modeling of melting point of structurally diverse data set of 717 bromides of nitrogen-containing organic cations (FULL) including 126 pyridinium bromides (PYR), 384 imidazolium and benzoimidazolium bromides (IMZ), and 207 quaternary ammonium bromides (QUAT). Several types of descriptors were tested: E-state indices, counts of atoms determined for E-state atom types, molecular descriptors generated by the DRAGON program, and different types of substructural molecular fragments. Predictive ability of the models was analyzed using a 5-fold external cross-validation procedure in which every compound in the parent set was included in one of five test sets. Among the 16 types of developed structure--melting point models, nonlinear SVM, ASNN, and BPNN techniques demonstrate slightly better performance over other methods. For the full set, the accuracy of predictions does not significantly change as a function of the type of descriptors. For other sets, the performance of descriptors varies as a function of method and data set used. The root-mean squared error (RMSE) of prediction calculated on independent test sets is in the range of 37.5-46.4 degrees C (FULL), 26.2-34.8 degrees C (PYR), 38.8-45.9 degrees C (IMZ), and 34.2-49.3 degrees C (QUAT). The moderate accuracy of predictions can be related to the quality of the experimental data used for obtaining the models as well as to difficulties to take into account the structural features of ionic liquids in the solid state (polymorphic effects, eutectics, glass formation). AU - Varnek, A.* AU - Kireeva, N.* AU - Tetko, I.V. AU - Baskin, I.I.* AU - Solov'ev, V.P.* C1 - 2733 C2 - 24910 SP - 1111-1122 TI - Exhaustive QSPR studies of a large diverse set of ionic liquids: How accurately can we predict melting points? JO - J. Chem. Inf. Model. VL - 47 IS - 3 PB - American Chemical Society PY - 2007 SN - 0021-9576 ER - TY - JOUR AU - Brüggemann, R.* AU - Restrepo, G.* AU - Voigt, K. C1 - 2174 C2 - 23499 SP - 894-902 TI - Structure-fate relationships of organic chemicals derived from the software packages E4CHEM and WHASSE. JO - J. Chem. Inf. Model. VL - 46 PY - 2006 SN - 0021-9576 ER - TY - JOUR AB - A benchmark of several popular methods, Associative Neural Networks (ANN), Support Vector Machines (SVM), k Nearest Neighbors (kNN), Maximal Margin Linear Programming (MMLP), Radial Basis Function Neural Network (RBFNN), and Multiple Linear Regression (MLR), is reported for quantitative-structure property relationships (QSPR) of stability constants logK1 for the 1:1 (M:L) and logbeta2 for 1:2 complexes of metal cations Ag+ and Eu3+ with diverse sets of organic molecules in water at 298 K and ionic strength 0.1 M. The methods were tested on three types of descriptors: molecular descriptors including E-state values, counts of atoms determined for E-state atom types, and substructural molecular fragments (SMF). Comparison of the models was performed using a 5-fold external cross-validation procedure. Robust statistical tests (bootstrap and Kolmogorov-Smirnov statistics) were employed to evaluate the significance of calculated models. The Wilcoxon signed-rank test was used to compare the performance of methods. Individual structure-complexation property models obtained with nonlinear methods demonstrated a significantly better performance than the models built using multilinear regression analysis (MLRA). However, the averaging of several MLRA models based on SMF descriptors provided as good of a prediction as the most efficient nonlinear techniques. Support Vector Machines and Associative Neural Networks contributed in the largest number of significant models. Models based on fragments (SMF descriptors and E-state counts) had higher prediction ability than those based on E-state indices. The use of SMF descriptors and E-state counts provided similar results, whereas E-state indices lead to less significant models. The current study illustrates the difficulties of quantitative comparison of different methods: conclusions based only on one data set without appropriate statistical tests could be wrong. AU - Tetko, I.V.* AU - Solov'ev, V.P.* AU - Antonov, A.V. AU - Yao, X.* AU - Doucet, J.P.* AU - Fan, B.* AU - Hoonakker, F.* AU - Fourches, D.* AU - Jost, P.* AU - Lachiche, N.* AU - Varnek, A.* C1 - 5394 C2 - 24095 SP - 808-819 TI - Benchmarking of linear and nonlinear approaches for quantitative structure-property relationship studies of metal complexation with ionophores. JO - J. Chem. Inf. Model. VL - 46 IS - 2 PY - 2006 SN - 0021-9576 ER - TY - JOUR AB - The mathematical and statistical evaluation of environmental data gains an increasing importance in environmental chemistry as the data sets become more complex. It is inarguable that different mathematical and statistical methods should be applied in order to compare results and to enhance the possible interpretation of the data. Very often several aspects have to be considered simultaneously, for example, several chemicals entailing a data matrix with objects (rows) and variables (columns). In this paper a data set is given concerning the pollution of 58 regions in the state of Baden-Württemberg, Germany, which are polluted with metals lead, cadmium, zinc, and with sulfur. For pragmatic reasons the evaluation is performed with the dichotomized data matrix. First this dichotomized 58 x 13 data matrix is evaluated by the Hasse diagram technique, a multicriteria evaluation method which has its scientific origin in Discrete Mathematics. Then the Partially Ordered Scalogram Analysis with Coordinates (POSAC) method is applied. It reduces the data matrix in plotting it in a two-dimensional space. A small given percentage of information is lost in this method. Important priority objects, like maximal and minimal objects (high and low polluted regions), can easily be detected by Hasse diagram technique and POSAC. Two variables attained exceptional importance by the data analysis shown here: TLS, Sulfur found in Tree Layer, is difficult to interpret and needs further investigations, whereas LRPB, Lead in Lumbricus Rubellus, seems to be a satisfying result because the earthworm is commonly discussed in the ecotoxicological literature as a specific and highly sensitive bioindicator. AU - Brüggemann, R.* AU - Welzl, G. AU - Voigt, K. C1 - 22790 C2 - 31080 SP - 1771-1779 TI - Order theoretical tools for the evaluation of complex regional pollution patterns. JO - J. Chem. Inf. Model. VL - 43 IS - 6 PB - ACS Publishing PY - 2003 SN - 0021-9576 ER - TY - JOUR AU - Brüggemann, R.* AU - Halfon, E.* AU - Welzl, G. AU - Voigt, K. AU - Steinberg, C.E.W.* C1 - 21704 C2 - 19896 SP - 918-925 TI - Applying the Concept of Partially Ordered Sets on the Ranking of Near-Shore Sediments by a Battery of Tests. JO - J. Chem. Inf. Model. VL - 41 PY - 2001 SN - 0021-9576 ER - TY - JOUR AU - Voigt, K. AU - Gasteiger, J.* AU - Brüggemann, R.* C1 - 21244 C2 - 19355 SP - 44-49 TI - Comparative Evaluation of Chemical and Environmental Online and CD-ROM Databases. JO - J. Chem. Inf. Model. VL - 40 PY - 2000 SN - 0021-9576 ER -