Tetko, I.V. ; Sushko, I. ; Pandey, A.K. ; Zhu, H. ; Tropsha, A.* ; Papa, E.* ; Öberg, T.* ; Todeschini, R.* ; Fourches, D.* ; Varnek, A.*
     
 
    
        
Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: Focusing on applicability domain and overfitting by variable selection.
    
    
        
    
    
        
        J. Chem. Inf. Model. 48, 1733-1746 (2008)
    
    
		
		
		  DOI
 DOI
		
		
		
		
		  
		
		
			Open Access Green möglich sobald Postprint bei der ZB eingereicht worden ist.
		
     
    
		
		
			
				The estimation of the accuracy of predictions is a critical problem in QSAR modeling. The "distance to model" can be defined as a metric that defines the similarity between the training set molecules and the test set compound for the given property in the context of a specific model. It could be expressed in many different ways, e.g., using Tanimoto coefficient, leverage, correlation in space of models, etc. In this paper we have used mixtures of Gaussian distributions as well as statistical tests to evaluate six types of distances to models with respect to their ability to discriminate compounds with small and large prediction errors. The analysis was performed for twelve QSAR models of aqueous toxicity against T. pyriformis obtained with different machine-learning methods and various types of descriptors. The distances to model based oil standard deviation of predicted toxicity calculated from the ensemble of models afforded the best results. This distance also successfully discriminated molecules with low and large prediction errors for a mechanism-based model developed using log P and the Maximum Acceptor Superdelocalizability descriptors. Thus, the distance to model metric could also be used to augment mechanistic QSAR models by estimating their prediction errors. Moreover, the accuracy of prediction is mainly determined by the training set data distribution in the chemistry and activity spaces but not by QSAR approaches used to develop the models. We have shown that incorrect validation of a model may result in the wrong estimation of its performance and suggested how this problem could be circumvented. The toxicity of 3182 and 48774 molecules from the EPA High Production Volume (HPV) Challenge Program and EINECS (European chemical Substances Information System), respectively, was predicted, and the accuracy of prediction was estimated. The developed models are available online at http://www.qspr.org site
			
			
				
			
		 
		
			
				
					
					Impact Factor
					Scopus SNIP
					Web of Science
Times Cited
					Scopus
Cited By
					
					Altmetric
					
				 
				
			 
		 
		
     
    
        Publikationstyp
        Artikel: Journalartikel
    
 
    
        Dokumenttyp
        Wissenschaftlicher Artikel
    
 
    
        Typ der Hochschulschrift
        
    
 
    
        Herausgeber
        
    
    
        Schlagwörter
        neural-networks; QSPR models; error estimation; validation; prediction; solubility; confidence; regression; molecules; accuracy
    
 
    
        Keywords plus
        
    
 
    
    
        Sprache
        
    
 
    
        Veröffentlichungsjahr
        2008
    
 
    
        Prepublished im Jahr 
        
    
 
    
        HGF-Berichtsjahr
        2008
    
 
    
    
        ISSN (print) / ISBN
        0021-9576
    
 
    
        e-ISSN
        1520-5142
    
 
    
        ISBN
        
    
 
    
        Bandtitel
        
    
 
    
        Konferenztitel
        
    
 
	
        Konferzenzdatum
        
    
     
	
        Konferenzort
        
    
 
	
        Konferenzband
        
    
 
     
		
    
        Quellenangaben
        
	    Band: 48,  
	    Heft: 9,  
	    Seiten: 1733-1746 
	    Artikelnummer: ,  
	    Supplement: ,  
	
    
 
  
        
            Reihe
            
        
 
        
            Verlag
            American Chemical Society (ACS)
        
 
        
            Verlagsort
            
        
 
	
        
            Tag d. mündl. Prüfung
            0000-00-00
        
 
        
            Betreuer
            
        
 
        
            Gutachter
            
        
 
        
            Prüfer
            
        
 
        
            Topic
            
        
 
	
        
            Hochschule
            
        
 
        
            Hochschulort
            
        
 
        
            Fakultät
            
        
 
    
        
            Veröffentlichungsdatum
            0000-00-00
        
 
         
        
            Anmeldedatum
            0000-00-00
        
 
        
            Anmelder/Inhaber
            
        
 
        
            weitere Inhaber
            
        
 
        
            Anmeldeland
            
        
 
        
            Priorität
            
        
 
    
        Begutachtungsstatus
        Peer reviewed
    
 
     
    
        POF Topic(s)
        30505 - New Technologies for Biomedical Discoveries
    
 
    
        Forschungsfeld(er)
        Enabling and Novel Technologies
    
 
    
        PSP-Element(e)
        G-503700-001
    
 
    
        Förderungen
        
    
 
    
        Copyright
        
    
 	
    
    
    
        Erfassungsdatum
        2008-12-31