Abstract
Ligand-based computational models could be more readily shared between researchers and organizations if they were generated with open source molecular descriptors [e.g., chemistry development kit (CDK)] and modeling algorithms, because this would negate the requirement for proprietary commercial software. We initially evaluated open source descriptors and model building algorithms using a training set of approximately 50,000 molecules and a test set of approximately 25,000 molecules with human liver microsomal metabolic stability data. A C5.0 decision tree model demonstrated that CDK descriptors together with a set of Smiles Arbitrary Target Specification (SMARTS) keys had good statistics [κ = 0.43, sensitivity = 0.57, specificity = 0.91, and positive predicted value (PPV) = 0.64], equivalent to those of models built with commercial Molecular Operating Environment 2D (MOE2D) and the same set of SMARTS keys (κ = 0.43, sensitivity = 0.58, specificity = 0.91, and PPV = 0.63). Extending the dataset to ∼193,000 molecules and generating a continuous model using Cubist with a combination of CDK and SMARTS keys or MOE2D and SMARTS keys confirmed this observation. When the continuous predictions and actual values were binned to get a categorical score we observed a similar κ statistic (0.42). The same combination of descriptor set and modeling method was applied to passive permeability and P-glycoprotein efflux data with similar model testing statistics. In summary, open source tools demonstrated predictive results comparable to those of commercial software with attendant cost savings. We discuss the advantages and disadvantages of open source descriptors and the opportunity for their use as a tool for organizations to share data precompetitively, avoiding repetition and assisting drug discovery.
Introduction
Problems associated with late-stage failures of potent lead compounds in the pharmaceutical industry due to undesirable physicochemical properties has led to a shift in the drug discovery protocols for well over the past decade. Pharmaceutical companies increasingly evaluate lead compounds for drug-like properties very early in the discovery process using computational prediction methods that are based on statistical techniques using experimental data from in vitro or physicochemical property assays (Ekins et al., 2000a). Well validated ligand-based in silico approaches are important and exist in the large pharmaceutical companies because these organizations have large diverse proprietary datasets, the financial resources for expensive commercial software, and access to in-house computational, medicinal chemistry and high-throughput screening expertise. All of these enablers are generally or in part lacking in academia, small biotechnology companies, and nonprofit neglected disease foundations.
The screening for absorption, distribution, metabolism, excretion, and toxicity (ADME/Tox) properties of molecules can be done using in vitro and in vivo methods, but they are not cost-effective to perform for very large numbers of compounds. Instead, in silico techniques to predict these properties can be used, and only those compounds that look likely to advance as lead molecules can be screened using in vitro and in vivo techniques. This approach can also lead to implementation of an active learning paradigm (Gupta and Gifford, 2009), in which we can use a computational model to make decisions whether we want to screen every compound or not (Fig. 1). The primary limitation of such computational methods today is the absence of optimal training sets that adequately cover chemical space (because they use small literature datasets or low-quality datasets or combine disparate datasets). When large proprietary datasets are used, the derived models are not publically available. The computational approaches are inherently only as good as the underlying data from which they are derived. If the models could be improved by leveraging more quality in vitro data and the methods could be widely used and understood by experimentalists as well as by computational scientists, it is evident that the results would be of enormous value to the entire drug discovery ecosystem, both industrial and academic.
Several ADME/Tox methods had been proposed at least a decade ago, and the application or comparison of these programs has been extensively studied (Ekins et al., 2000b, 2001a; Ekins, 2007; Villoutreix et al., 2007; Lagorce et al., 2008). The challenge is not only that sizable drug discovery data (training sets) are lacking for model building but also that there has been no mechanism to bring together isolated training sets, especially the very large proprietary datasets from different companies. If the sensitive intellectual property contained in the training sets could be obfuscated, pharmaceutical organizations would often want to share these models with collaborators and academics working on important neglected diseases, for example. There have been some efforts in understanding how chemical information can be shared without directly sharing structures; for example, fingerprints should be avoided and low levels of precision in numeric descriptors and feature count descriptors may be “fuzzy” enough to protect the structure identity (Masek et al., 2008).
Software developed under the open source license provides important visibility into the implementation of descriptors and algorithms, so that computational chemists can verify the algorithm and suggest or actually contribute improvements (Guha et al., 2006). There are a number of open source software packages that calculate molecular descriptors (Melville and Hirst, 2007; Sykora and Leahy, 2008) or implement modeling algorithms (e.g., R). Some groups have also used open descriptors and open modeling algorithms to build quantitative structure-activity relationship (QSAR) models (Guangli and Yiyu, 2006; Melville and Hirst, 2007; Guha, 2008) for mutagenicity, cytotoxicity, and Caco-2 data as well as some drug targets. The datasets used have been relatively small to date (low thousands of molecules). Although there are some open toolkits for cheminformatics and bioinformatics (Guha et al., 2006; Steinbeck et al., 2006; Spjuth et al., 2007, 2009) as well as proposed Web services (Dong et al., 2007), no integrated open toolkit exists at the time of writing. In the current study, we evaluate how some open descriptors and algorithms perform versus commercial software for generating ADME/Tox models with very large datasets produced at Pfizer.
Materials and Methods
Datasets.
All compounds tested were synthesized at Pfizer as part of drug discovery projects. The dataset size generally exceeded 60,000 compounds for each assay. The datasets were binned as per the guidance provided by experts in the Pharmacokinetics, Dynamics and Metabolism (PDM) business unit. In the work presented here, we have primarily evaluated multiple datasets, which all used Pfizer in-house compound screening. The datasets described in this work are human liver microsomal stability (HLM), passive permeability (RRCK), and P-glycoprotein (P-gp) efflux activity (MDR). We also briefly present one case using literature solubility data.
The human liver microsomal stability dataset has more than 200,000 compounds and covers a diverse range of chemistry as well as diversity coverage of therapeutic areas in which these compounds have been developed. This assay allows the measurement of the apparent intrinsic clearance (Clint) of a compound in human liver. The dynamic range of clearance is low [Clint < 13 μl/(min · mg)], moderate [13 < Clint < 50 μl/(min · mg)], and high [Clint > 50 μl/(min · mg)] (Table 1). A three-bin classification model as well as a continuous model on the full dataset was built. The distribution of the data in each class in this and the other datasets is shown in Table 2.
The permeability dataset, which has more than 70,000 compounds is a cell-based assay, which represents cellular passive apparent permeability (Papp as the endpoint). The Papp values are rates (expressed as 10−6 cm/s); the higher the value, the faster the compound crosses the cell monolayer. The dynamic range for Papp is described as follows: Papp < 2.5 cm/s (low), 2.5 cm/s < Papp< 10 cm/s (moderate), and Papp>10 cm/s (high) (Table 1). For passive permeability, a three-bin classification model was built.
The P-gp efflux activity dataset has more than 60,000 diverse compounds. This is also a cell-based assay, which is used to assess P-gp efflux activity, measured by cell lines transfected with the human MDR-1 gene. The cell line is used in a bidirectional evaluation of permeability [apical to basolateral (A to B) and basolateral to apical (B to A)], generating a final (B to A)/(A to B) ratio that can be used to determine whether there is asymmetry in the flux due to transporter activity. A compound is considered to be effluxed if its (B to A)/(A to B) ratio is 2.5 or greater in any of the individual cell lines (Table 1). For this particular endpoint, a two-bin classification model was derived. Thus, these three rich datasets provide us with a very good starting point for our analysis with which we can test the metric for the model prediction.
As a test for our strategy to demonstrate the applicability and equality of models from open source descriptors and modeling methods, we also applied the same set of descriptors as applied to the above-mentioned datasets and modeling methods to a public domain dataset, namely the aqueous solubility dataset of Huuskonen (2000). This publication described the application of neural networks and linear regression with topological indices and E-state descriptors on a set of ∼1300 diverse organic compounds.
Descriptors.
We used different descriptors such as the Pfizer modified Molecular Operating Environment 2D (MOE2D) set (2008) and CDK (http://cdk.sourceforge.net/) (Steinbeck et al., 2006) fingerprints and others. For each of the datasets the same descriptors were calculated, i.e., for RRCK, HLM, and others, we always calculated Pfizer modified MOE2D descriptors (463), CDK descriptors (195), and SMARTS keys (355). There was no descriptor pruning or feature selection performed because the in silico QSAR models discussed here are routinely updated with new screening data, and recalculating descriptors for ∼100,000 compounds each time would be too computationally expensive. To avoid issues for which we need to add new descriptors to capture new chemotypes, we routinely calculate and test new descriptor sets and add them to the list as necessary.
The 355 SMARTS keys (a set of SMARTS strings used as count-based descriptors put together very carefully by scientists at Pfizer) cover a wide range of SMARTS-encoded substructural fragment/feature descriptors. Multiple internal studies at Pfizer have shown that addition of these type of descriptors to the physicochemical-based descriptors provides better models than addition of those built solely on physicochemical property-based descriptors. We are also constantly evaluating new substructural fragments that could be added to the existing set of SMARTS keys.
Model Building.
We have made this study as exhaustive and comprehensive as possible by comparing results for a variety of modeling methods such as Random Forest (Liaw and Wiener, 2002), SVM (Chang and Lin, http://www.csie.ntu.edu.tw/∼cjlin/papers/guide/guide.pdf), Recursive Partitioning (RP) Forest (SciTegic Pipeline Pilot version 7.5.2; Accelrys, San Diego, CA), and others.
The key component of this work was to first build either a robust continuous or categorical model that would be able to deal with the diverse datasets. For continuous models, the endpoint data were scaled by taking a base 10 logarithm (log10) of the data. By using the full data and descriptor matrix, a regression model was built using the Rulequest Cubist ((Quinlan, 1991) modeling algorithm or a categorical model was built using the Rulequest C5.0 (Quinlan, 1993) modeling method or other modeling methods such as SVM, Random Forest, or RP Forest.
The Rulequest Cubist algorithm can be defined as a piecewise linear modeling method with boosted trees (with an exception that the rules can overlap). It can also construct multiple models (committees) and can combine “rule-based” models with “instance-based” (nearest neighbor) models (http://rulequest.com/cubist-unix.html). The committee models are made up of several rule-based models. Each member of the committee predicts the target value for a case, and the members' predictions are averaged to give a final prediction. In the present work we have used five instances, i.e., the algorithm would look for five closest neighbors to our test compound in the dataset, 20 committees, and the default setting of the rules (1000). An important fact to mention here is that Rulequest Cubist makes the nearest neighbor search based on the Manhattan Distance (also known as city block distance) in the descriptor space. This combination of parameters was found to be the most optimal combination for superior prediction and computational efficiency by multiple in-house studies (results not shown).
For each of the categorical models discussed in this study, the training and test sets were put together (Table 2) by using a maximum dissimilarity algorithm, which allowed representative subsets of the larger datasets. It is always beneficial to test new techniques with a variety of datasets to make sure that the metric is not assay- or endpoint-dependent, where it may be successful in some and fail in others.
Model Testing and Evaluation.
As a standard for determining the quality of the classification models built on datasets described above, the κ statistic (κ >0.4) was used as a measure of the “predictability” of the model. The κ statistic (Carletta, 1996; Cohen, 2003) can be defined as an index that compares the agreement against that which might be expected by chance (eq. 1). κ can be thought of as the chance-corrected proportional agreement and possible values range from +1 (perfect agreement) via 0 (no agreement above that expected by chance) to −1 (complete disagreement): where Pr(a) is the relative observed agreement among raters and Pr(e) is the hypothetical probability of chance agreement, using the observed data to calculate the probabilities of each observer randomly saying each category. If the raters are in complete agreement then κ = 1. If there is no agreement among the raters (other than what would be expected by chance), then κ ≤ 0.
For continuous models R2 (the square of the sample correlation coefficient between the outcome and the values being used for prediction) and root mean squared error (RMSE) were evaluated on test datasets as a quality measurement. R2 provides information on the goodness of the fit, i.e., how well the regression line approximates the real data points (eq. 2). The value is between 0 and 1, 0 being no correlation and 1 being a perfect correlation, or, in other words, R2 = 1 indicates that the regression line perfectly fits the data: where SSerr is the residual sum of squares and SStot is the total sum of squares.
RMSE is a frequently used measure of the differences between values predicted by a model or an estimator and the values actually observed from the object being modeled or estimated. RMSE is a good measure of precision.
Chemical Space Analysis.
A visualization of the chemical space covered by a test and training set can be created using principal component analysis (PCA) (Zientek et al., 2010). Such a visual chemical space map was generated for the HLM dataset by converting the CDK and SMARTS descriptors (total ∼579 descriptors) into principal components (PCs). The PCA component in Pipeline Pilot (Accelrys) was used for this calculation. First, we calculated the PCs for the training set, i.e., a matrix of ∼193,000 compounds and 579 descriptors and then we calculated the PCs for the test set (2300 compounds and 579 descriptors). These PCs for the training and test sets were used to produce a scatterplot with Spotfire (TIBCO, Somerville, MA).
Results
Model Testing and Evaluation: Human Liver Microsomal Stability.
Initially using approximately 50,000 molecules with human microsomal metabolic stability data (Tables 2 and 3) with the C5.0 decision tree building software, we have clearly demonstrated that a combination of open CDK and SMARTS keys is equivalent to models built with a combination of MOE2D and SMARTS keys based on the statistics presented. Recent data extended this model with more than 193,000 molecules and testing on approximately 2300 molecules (Table 4). We were able to build a regression model and a classification model. To make sure that each model is predictive, we had split the data into a training set (80% of the total data) and a test set (20% of the total data) using the venetian blind splitting method (Davis et al., 2006), which allows retention of the identical data distribution for the test set and the training set. As an example, to perform an 80-20 split, every fifth compound in an activity sorted list can be added to the test set, and the remaining data become the training set. In addition, because HLM is a high-throughput assay, we get approximately ∼1500 to 2000 compounds screened every 2 weeks, and hence the newly screened compound list was used as a blind test set. As shown in Table 4, the R2 on the 20% test set is ∼0.7, which is a relatively high correlation coefficient for a dataset of this magnitude. In addition, RMSE was 0.291, which is good because the model did not find too many errors in prediction that could be penalized. When we observe the performance of this model on the blind dataset (which usually has new chemotypes not represented in the training set or in the 20% set-aside test set), we see a Pearson correlation of 0.53, which is also a reasonable value for such a dataset because it probably lies partially outside of the applicability domain of the current training set used to train the model. Comparing these results between the two different sets of descriptors, we can clearly observe that there is no difference in the quality of the two models. The advantage of using CDK descriptors with SMARTS keys is that we have reduced our total descriptors from 818 (MOE2D and SMARTS) to 550, which allows us to reduce the dimensionality and hence significantly improve the speed of the calculations (for the predictions this scales linearly with number of descriptors) (Table 4). Another test was done to check the performance of the regression model against the categorical (or binned) model that was previously built on the same dataset. We binned the actual and predicted values and then compared the statistics for models built using the two sets of descriptors, and, as shown, the results were very similar, as we achieved a κ statistic of 0.4 or higher (Table 4). This finding suggested that comparable data could be generated between all open descriptors and commercial descriptors with HLM.
Model Testing and Evaluation: RRCK Passive Permeability.
For RRCK passive permeability (Table 5), we built a categorical model that allowed a compound to be predicted as high risk, moderate risk, or low risk based on the criteria provided to us by PDM domain experts. For in-house applications, we only use a continuous model, but for this study we built only categorical models to compare the model performance for different combinations of modeling methods and descriptors (Table 5). The results are very promising for the C5.0 modeling method with MOE2D and SMARTS keys as descriptors. This was our baseline because a very similar continuous version of this model is what has been implemented in-house for the research community within Pfizer. The idea at this point was to provide an in silico model that is either equivalent to or better than the baseline model. Going through various combinations of modeling methods and descriptors, we were able to determine that the C5.0 modeling method when combined with CDK and SMARTS keys as descriptors performed approximately the same as our baseline model. Other model and descriptor combinations either did not complete because of the memory intensiveness of the calculations or they were not comparable to the baseline.
Model Testing and Evaluation: P-gp Efflux Data.
Another test case was chosen using P-gp efflux data in which the data were segregated into two bins, i.e., low risk and high risk as recommended by colleagues in the PDM group. As shown in Table 6, an exercise similar to the one described above was performed for RRCK passive permeability, in which a baseline model and descriptor combination was chosen, and then we tested a variety of modeling methods and descriptor combinations to find a better or equivalent combination. Once again the results were encouraging as we observed that predictions for the C5.0 models built with either the MOE2D and SMARTS keys or CDK and SMARTS keys were about the same as the baseline model (Table 6).
There may be concerns that the κ values are not exactly similar among all the datasets, but on the basis of the dataset size and variation in the type of descriptors, the differences between κ for the baseline model and a comparable model are similar. Moreover, with a reduced number of descriptors we always have the advantage of less computationally intensive calculations.
Model Testing and Evaluation: Aqueous Solubility Dataset.
For further proof of this modeling approach, we used smaller datasets from the public domain for benchmarking, such as the aqueous solubility dataset for regression modeling of Huuskonen (2000) (using more than 1000 molecules for training and more than 200 for testing), which suggested data comparable to the published R2 = 0.92 (data not shown). This result illustrates that we can use open descriptors and model building algorithms to build models with predictions equivalent to those with commercial descriptors for both large and small classification and continuous datasets.
Discussion
Computational QSAR models are primarily based on proprietary software, training data, and descriptors and are stored in proprietary file formats. These models are locked into a particular set of prerequisites and intellectual property restrictions that cannot be replicated anywhere else. The organizations best able to leverage QSAR modeling today are large pharmaceutical organizations, which have the resources to generate their own extremely large (hundreds of thousands of compounds with) high-quality, diverse training sets and are able to standardize use of expensive proprietary modeling software across their organizations while then deploying their models on the intranet. One area of focus for at least a decade has been ADME/Tox modeling and high-throughput screening, which has now resulted in very large numbers of compounds and data available for modeling using machine learning methods, such as those described here.
There is also considerable discussion about how to evaluate computational models in other areas (Organisation for Economic Co-operation and Development, 2004; Dearden et al., 2009) yet there are no clear standard methods for evaluating model robustness for ADME/Tox properties. Widely accepted approaches for model validation and testing such large datasets include leave out groups at random many times (also known as X-fold cross validation, where X is the times you want to leave a subset out to test the model, e.g., leave out 20% or leave out 50% 100 times) or external test sets (Tetko et al., 2008; Zheng et al., 2009). We could define acceptable models (depending on the endpoint) that could predict correct classes >70% or correlations for an external test set that were statistically significant using a Pearson or Spearman coefficient. We have described several metrics of model quality that could also be used including the κ value.
In this study, we have focused on models for some key ADME/Tox endpoints (namely, metabolic stability in HLM, passive permeability, and P-gp efflux), for which Pfizer has very large proprietary datasets. There have been several computational models of metabolic stability in the literature. For example, a recursive partitioning model containing 875 molecules with HLM metabolic stability was used to predict and rank the clearance of 41 drugs (Ekins, 2003). A k-nearest neighbor model of metabolic stability data using human S9 homogenate for 631 diverse molecules was able to adequately classify metabolism of a further set of more than 100 molecules (Shen et al., 2003). Partial least-squares regression QSAR models developed with molecular structure descriptors from QikProp (Jorgensen, 2004) and Diverse Solutions software were used with a set of 130 calcitriol analogs (Jensen et al., 2003). The latter model was used to select 20 molecules for in vitro testing with an 85% success rate (Jensen et al., 2003). To our knowledge the current study with more than ∼193,000 compounds is probably the largest validated metabolic stability model to date. Although there have been a vast number of models generated for ADME/Tox and physicochemical properties such as solubility (Cheng and Merz, 2003; Lind and Maltseva, 2003; Yamashita et al., 2006), even these models have not used such large numbers of compounds. Passive permeability has also been the focus for extensive modeling for many years initially based on Caco-2 or Madin-Darby canine kidney cell data (Segarra et al., 1999; Ekins et al., 2000b, 2001b; Ren and Lien, 2000; Stenberg et al., 2000). These models generally do not take into account the role of efflux transporters, so there has been a parallel effort to build various computational models for the major transporters in the intestine such as P-gp (Ekins et al., 2002, 2007; Xue et al., 2004; Pleban et al., 2005; Chang et al., 2006). This study also probably uses the largest available training sets available in the industry to our knowledge for these two endpoints. Our model predictions could be combined to provide a more complete picture of absorption. Ultimately other efflux and uptake transporters (Ekins et al., 2007) should also be modeled to account for outliers.
The results of this study may provide a starting point for a validated universal framework for enabling the sharing of ADME/Tox models and facilitating their use for making predictions by third parties, without the requirement of sharing sensitive molecule structure data. It remains to be seen how well the descriptors used mask the structure identity, and further studies will be required for assessment of this factor, such as that performed with other descriptors (Masek et al., 2008). We have not described the actual sharing of models or the format, for doing this will require testing to ensure compatibility and reproducibility among laboratories. These open models could be integrated into other software that would allow their selective sharing with selected users. One could readily integrate open molecular descriptors from the CDK, an LGPL Java cheminformatics library that is used in a wide variety of academic and commercial tools (Steinbeck et al., 2006), with modeling algorithms from the GPLed statistical software package, R (http://www.r-project.org). The CDK supports integration with R (Guha et al., 2006; Guha, 2008), so these two tools provide a promising starting point. An alternative open algorithm source is Weka (Frank et al., 2004), which is also widely used.
To be of further value such models will require measures of prediction confidence and applicability domain, which will assist the user in interpreting model predictions. A combination of Tanimoto similarity, PCA, clustering, and Mahalanobis distance has been used to determine prediction confidence (Sheridan et al., 2004; Dimitrov et al., 2005; Ekins et al., 2006a; Tetko et al., 2006, 2008; Chekmarev et al., 2008, 2009; Kortagere et al., 2008, 2009). A prediction should not be provided if the test molecule is too far away from a training set member as defined by the user based on a combination of distance as well as a similarity metric of choice. A prototype measure of prediction confidence is already in place for continuous models developed in this study (Gupta and Gifford, 2009). This confidence metric has a sound statistical foundation, in which the metric captures both error in prediction as well as distance (similarity) to the neighbors in the chemical space as defined by the descriptor used in the model. We had previously used this confidence metric to establish an in silico screening strategy, which also leads to active learning implementation (data not shown). The idea behind this work is to use in silico models to screen compounds that must get in vitro data and prevent compounds that do not need to be screened if an in silico model can already predict the value with very high confidence (Fig. 1). This process would allow significant cost savings by not screening each compound. A recent study by us suggested an approximately 30% saving in in vitro testing by implementing computational models (Zientek et al., 2010). Clearly, we should also be cognizant of the chemical space coverage of the model. For example, we have visualized the large dataset of >193,000 HLM data using a PCA analysis (Fig. 2); the majority of the >2000 test compounds overlap the training set, but there are compounds that could be considered outside of the training set and their removal may improve predictions.
Part of the problem with some QSAR approaches is that a model output is often not inherently interpretable by other researchers. When the models are black boxes, the outputs will not be widely embraced. There have also been efforts made for ADME data visualization (Stoner et al., 2004a,b; Ekins et al., 2006b; Maniyar et al., 2006; Yamashita et al., 2006, 2008). Expanding these approaches to show outputs from multiple computational models in a color-coded or symbolic manner will require significant innovation to balance information complexity with intuitive graphical representations. Development of truly novel, simple, and interpretable visualization methods is not trivial, yet long overdue, and this could be based on open source ADME/Tox models developed in a manner similar to those described in the current study. Some thought should also be given to how the ADME/Tox models are used, e.g., early stages of compound library development to minimize the number of compounds synthesized and improve their ADME/Tox properties versus later in discovery to try to solve problems with these same properties.
This work could be greatly expanded in future to build a community of models with data from a consortium of leading industry and academic partners by validating their quality and predictability. These data would represent precompetitive data, and we would need to ensure that molecular structures could not be distinguished or reverse-engineered from the training sets and descriptors upon which the models were built (Masek et al., 2008). ADME/Tox programs have been traditionally limited in using the same very small datasets from the literature or combining datasets from different groups. In addition, these datasets only cover a small region of chemical space focused on drug-like molecules, which tend to be compliant with the rule of 5 (Lipinski et al., 1997). This scenario may not be ideal because it could be too restrictive, considering there are many examples of U.S. Food and Drug Administration-approved drugs that fail these rules and others. Thus, there is a need for building models using data from various pharmaceutical and biotechnology companies and then securely sharing the models with collaborators or groups designated by the user. The advantage of using such data from pharmaceutical and biotech companies is that they have generally screened orders of magnitude more data (e.g., tens to hundreds of thousands of compounds under standardized conditions) than are in the public domain and thus have far better coverage of chemistry space. There is, of course, a tradeoff here between small local models useful for lead optimization (and limited chemical coverage) and large global models that may be useful for library screening and filtering (greater chemical coverage but less likelihood of distinguishing differences between similar structures).
Finding a quality dataset is still an issue because most of the experimental studies on large datasets to derive ADME/Tox properties are still performed by pharmaceutical companies and the data are inaccessible (Ekins and Williams, 2010). Some forums such as http://www.cheminformatics.org, QSAR World (http://www.qsarworld.com/), http://www.opentox.org, and http://www.openqsar.com are making an effort to collect these datasets as an open repository for chemoinformatics data as well as toolkits for models and descriptors, e.g., CDK (Steinbeck et al., 2006) and Mold2 (Hong et al., 2008).
The beneficiaries of open ADME/Tox models would be those in academia, foundations (e.g., working on neglected diseases such as tuberculosis and malaria), and pharmaceutical companies, which could avoid duplicative testing and cover more chemical space. Use of these models could result in improved predictions and greater applicability of such models for use by groups with compounds of interest, but with no idea of their ADME properties, and ultimately predict likely issues before they become major hurdles to a project. This study suggests a new approach to sharing ADME/Tox models built using widely available open descriptors and algorithms.
Footnotes
M.H., B.A.B., and S.E. are employees or consultants of Collaborative Drug Discovery, Inc.
This work was supported by Collaborative Drug Discovery, Inc. funding from the Bill and Melinda Gates Foundation [Grant 49852] (“Collaborative Drug Discovery for TB through a Novel Database of SAR Data Optimized to Promote Data Archiving and Sharing”).
Article, publication date, and citation information can be found at http://dmd.aspetjournals.org.
doi:10.1124/dmd.110.034918.
-
ABBREVIATIONS:
- ADME/Tox
- absorption, distribution, metabolism, excretion, and toxicity
- QSAR
- quantitative structure-activity relationship
- PDM
- Pharmacokinetics, Dynamics and Metabolism
- HLM
- human liver microsomes
- RRCK
- Russ Ralph canine kidney
- P-gp
- P-glycoprotein
- MDR
- multidrug resistance
- Clint
- intrinsic clearance
- Papp
- passive apparent permeability
- A
- apical
- B
- basolateral
- MOE2D
- Molecular Operating Environment 2D
- CDK
- chemistry development kit
- SVM
- support vector machine
- RP
- Recursive Partitioning
- RMSE
- root mean squared error
- PCA
- principal component analysis
- PC
- principal component
- SMARTS
- Smiles Arbitrary Target Specification.
- Received June 9, 2010.
- Accepted August 3, 2010.
- Copyright © 2010 by The American Society for Pharmacology and Experimental Therapeutics