Abstract
Ligand-based computational models could be more readily shared between researchers and organizations if they were generated with open source molecular descriptors [e.g., chemistry development kit (CDK)] and modeling algorithms, because this would negate the requirement for proprietary commercial software. We initially evaluated open source descriptors and model building algorithms using a training set of approximately 50,000 molecules and a test set of approximately 25,000 molecules with human liver microsomal metabolic stability data. A C5.0 decision tree model demonstrated that CDK descriptors together with a set of Smiles Arbitrary Target Specification (SMARTS) keys had good statistics [κ = 0.43, sensitivity = 0.57, specificity = 0.91, and positive predicted value (PPV) = 0.64], equivalent to those of models built with commercial Molecular Operating Environment 2D (MOE2D) and the same set of SMARTS keys (κ = 0.43, sensitivity = 0.58, specificity = 0.91, and PPV = 0.63). Extending the dataset to ∼193,000 molecules and generating a continuous model using Cubist with a combination of CDK and SMARTS keys or MOE2D and SMARTS keys confirmed this observation. When the continuous predictions and actual values were binned to get a categorical score we observed a similar κ statistic (0.42). The same combination of descriptor set and modeling method was applied to passive permeability and P-glycoprotein efflux data with similar model testing statistics. In summary, open source tools demonstrated predictive results comparable to those of commercial software with attendant cost savings. We discuss the advantages and disadvantages of open source descriptors and the opportunity for their use as a tool for organizations to share data precompetitively, avoiding repetition and assisting drug discovery.
Footnotes
M.H., B.A.B., and S.E. are employees or consultants of Collaborative Drug Discovery, Inc.
This work was supported by Collaborative Drug Discovery, Inc. funding from the Bill and Melinda Gates Foundation [Grant 49852] (“Collaborative Drug Discovery for TB through a Novel Database of SAR Data Optimized to Promote Data Archiving and Sharing”).
Article, publication date, and citation information can be found at http://dmd.aspetjournals.org.
doi:10.1124/dmd.110.034918.
-
ABBREVIATIONS:
- ADME/Tox
- absorption, distribution, metabolism, excretion, and toxicity
- QSAR
- quantitative structure-activity relationship
- PDM
- Pharmacokinetics, Dynamics and Metabolism
- HLM
- human liver microsomes
- RRCK
- Russ Ralph canine kidney
- P-gp
- P-glycoprotein
- MDR
- multidrug resistance
- Clint
- intrinsic clearance
- Papp
- passive apparent permeability
- A
- apical
- B
- basolateral
- MOE2D
- Molecular Operating Environment 2D
- CDK
- chemistry development kit
- SVM
- support vector machine
- RP
- Recursive Partitioning
- RMSE
- root mean squared error
- PCA
- principal component analysis
- PC
- principal component
- SMARTS
- Smiles Arbitrary Target Specification.
- Received June 9, 2010.
- Accepted August 3, 2010.
- Copyright © 2010 by The American Society for Pharmacology and Experimental Therapeutics
DMD articles become freely available 12 months after publication, and remain freely available for 5 years.Non-open access articles that fall outside this five year window are available only to institutional subscribers and current ASPET members, or through the article purchase feature at the bottom of the page.
|