Drug Metabolism and Disposition Fast Forward
First published on December 1, 2006; DOI: 10.1124/dmd.106.013185
0090-9556/07/3503-325-327$20.00
DMD 35:325-327, 2007
SHORT COMMUNICATION
Classification of Metabolites with Kernel-Partial Least Squares (K-PLS)
Mark J. Embrechts, and
Sean Ekins
Department of Decision Sciences and Engineering Systems, Rensselaer Polytechnic Institute, Troy, New York (M.J.E.); and GeneGo, Inc., St. Joseph, Michigan, and Department of Pharmaceutical Sciences, University of Maryland, Baltimore, Maryland (S.E.)
(Received October 2, 2006;
accepted November 29, 2006)
 |
Abstract
|
|---|
Numerous experimental and computational approaches have been developed to predict human drug metabolism. Since databases of human drug metabolism information are widely available, these can be used to train computational algorithms and generate predictive approaches. In turn, they may be used to assist in the identification of possible metabolites from a large number of molecules in drug discovery based on molecular structure alone. In the current study we have used a commercially available database (MetaDrug) and extracted a fraction of the human drug metabolism data. These data were used along with augmented atom descriptors in a predictive machine learning model, kernel-partial least squares (K-PLS). A total of 317 molecules, including parent drugs and their primary and secondary (sequential) metabolites, were used to build these models corresponding to individual metabolism rules, representing the formation of discrete metabolites, e.g., N-dealkylation. Each model was internally validated to assess the capability to classify other molecules that were left out. Using receiver operator curve statistics models for N-dealkylation, O-dealkylation, aromatic hydroxylation, aliphatic hydroxylation, O-glucuronidation, and O-sulfation gave area under the curve values from 0.75 to 0.84 and were able to predict between 61 and 79% active molecules upon leave-one-out testing. This preliminary study indicates that K-PLS and possibly other similar machine learning methods (such as support vector machines) can be applied to predicting human drug metabolite formation in a classification manner. Improvements can be achieved using considerably larger datasets that contain more positive examples for the less frequently occurring metabolite rules, as well as the external evaluation of novel molecules.
With the emphasis now on increasing the efficiency of drug discovery, there is interest in using predictive computational approaches to complement in vitro and in vivo studies. In the area of metabolism prediction, these techniques encompass pharmacophores (Ekins et al., 2001
), quantitative structure-activity relationships (QSARs) (Shen et al., 2003
; Balakin et al., 2004
), electronic models (Korzekwa et al., 2004
), and commercial drug metabolism databases (Borodina et al., 2004
), as well as other methods that have been comprehensively reviewed elsewhere (de Graaf et al., 2005
; Ekins et al., 2005a
; de Groot, 2006
). Some approaches have combined metabolite data and rules for suggesting metabolic pathways across multiple species (Erhardt, 2003
). Such databases may also be useful for calculating the probability for a given metabolic reaction (Boyer and Zamora, 2002
) to then indicate potential metabolites and the sites of metabolism using statistical or algorithmic approaches (Borodina et al., 2004
). Although these types of comprehensive databases generally enable numerous search options to retrieve molecule structures and published information, the predictive capabilities seem limited at present (Wishart et al., 2006
). A major limitation is that they are unlikely to have a complete dataset of reactions and molecular structures to extrapolate for a new molecule. In turn, the user is reliant on the quality of the published in vitro or in vivo data which, in many cases, may predate modern analytical methods, such that older published metabolic pathways may be incomplete. In reality, such database approaches provide knowledge of most published data and are perhaps limited to interpolation.
The combination of different approaches to drug metabolite prediction may balance the strengths and weaknesses of each approach, and several commercial methods are now pursuing this direction. MetaDrug represents one such method, combining a manually annotated database of human drug metabolism information including xenobiotic reactions, enzyme substrates, and enzyme inhibitors with kinetic data (Ekins et al., 2005b
, 2006
). This database has enabled the generation of rules for predicting likely metabolic reactions. The parent molecule and metabolites may then be scored through integrated QSAR models and rules for molecule reactivity before visualizing molecules as nodes on a network diagram (Ekins et al., 2005b
, 2006
).
Such rule-based metabolite predictions indicate that it is possible to generate many more metabolites than have been identified in the literature, which may make the methods less useful (Ekins et al., 2006
). We are therefore investigating approaches to limit the metabolites to those that are most likely. Recently, a number of machine learning approaches including support vector machines and kernel-partial least squares (K-PLS) (Rosipal and Trejo, 2001
) have been implemented in a single software package (Analyze/StripMiner), and this package was used with several benchmark datasets (Bennett and Embrechts, 2003
) including protein binding and other physicochemical properties. The results with K-PLS indicated that it could be favorably applied to other datasets to enable QSAR model construction and aid drug discovery research. In the current proof of concept study, we have used K-PLS to generate preliminary classification models to identify whether a metabolite is likely to be produced for a particular parent molecule.
 |
Materials and Methods
|
|---|
Literature Data. Three hundred seventeen molecules were randomly extracted from the MetaDrug database (GeneGo Inc., St. Joseph, MI) (Ekins et al., 2006
), and this represents a small fraction of the human drug metabolism content. These molecules were prepared as an sdf file containing data for the 65 metabolic pathways of interest (Ekins et al., 2005a
) with binary data for the presence or absence of a metabolite.
Descriptor Calculation. ChemTree software (GoldenHelix, Bozeman, MT) running on a Pentium 4 processor was used to generate augmented atom molecular descriptors (Young et al., 2002
) representing the presence or absence of a particular heavy atom with its immediately bonded neighbors. In total, 61 descriptors were generated for the set of molecules.
Data Preprocessing. Metabolic reactions with greater that two examples of the metabolite rule were then used for modeling; this narrowed down the dataset considerably. The matrix of molecular descriptors and biological activity data were then scaled (normalized) and variables with unchanging values were removed using feature selection with the Analyze/StripMiner software (software available from M.J.E. at http://www.rpi.edu/locker/82/001182/) (Embrechts et al., 2001
). From the descriptors with more than 95% correlation between each other (i.e., "cousin descriptors"), only the descriptors most correlated with the response were retained. In addition, four sigma outliers were brought within 2.5 sigma.
K-PLS Modeling Method and Testing. The Analyze software uses the K-PLS method (Rosipal and Trejo, 2001
) with two key parameters, the number of latent variables and the Parzen window or Gaussian kernel sigma. In this study, the number of latent variables is held fixed at 5, and the Gaussian kernel sigmas are tuned using a second-order Newton method in which the performance criterion is the error minimization on the validation data using 5-fold cross-validation. The sigmas were tuned just once, using the metabolite with the most positive instance cases. Sigma tuning on just one single metabolite is a conservative approach that prevents over-tuning. Furthermore, the fact that the model still has a good predictive power on the other metabolites is another indication that over-tuning did not occur in this case.

View larger version (11K):
[in this window]
[in a new window]
|
FIG. 1. Representative receiver operator curves to demonstrate the leave-one-out validation of K-PLS classification model for N-dealkylation (non-straight line). Diagonal = random rate.
|
|
K-PLS uses kernels and can therefore be seen as a nonlinear extension of the PLS method. The commonly used radial basis function kernel or Gaussian kernel was applied, where the kernel is expressed as follows (Christianini and Shawe-Taylor, 2000
):
 |
The K-PLS method can be reformulated to resemble support vector machines, but it can also be interpreted as a kernel with centering transformation of the descriptor data followed by a regular PLS method (Bennett and Embrechts, 2003
).
For the predictive modeling on the other metabolites, the same sigmas were used. Sigma-tuning also allows for an identification procedure for pointing out the most relevant attributes by considering that the attributes with the larger sigma values are less relevant. After sigma tuning, the individual metabolites were predicted using K-PLS with a Gaussian kernel with multiple sigmas, using a leave-one-out procedure. Because the number of positive examples of a metabolite generally was exceeded by the number of negative instances, the discrimination between positive and negative cases was made using a bias with a threshold of 0.5 for choosing the operating point on the receiver operator curve. The area under the curve (AUC) values were also calculated, with higher values approximating to better classifications. Because of the imbalance in the number of positive and negative examples, the balanced error rate was calculated taking the average of the number of correct that were positive and the number correct that were negative. In this case, higher numbers are preferable.
 |
Results and Discussion
|
|---|
Tools for predicting potential metabolites of small molecule substrates in early drug discovery are important in guiding lead optimization to produce drug candidates with desirable metabolic and toxicological properties. We have recently developed and tested a computational tool that comprises a rule-based method for metabolite prediction, integrated QSAR models, and a database of human metabolic and signaling information (Ekins et al., 2006
). In silico metabolite prediction typically generates many more potential metabolites than are actually observed. The emergence of machine learning tools combined with databases of human metabolism information represent methods for producing more reliable predictions of metabolites from an input structure alone. In the current study, for each of the more than 300 molecules selected from the MetaDrug database with metabolism information, two-dimensional molecular descriptors were calculated. Twenty-three of the 65 reactions had sufficient binary data for modeling, and a K-PLS model was produced for each using the Analyze/StripMiner software (Bennett and Embrechts, 2003
). We evaluated the resulting classification models for predicting metabolic reactions after leave-one-out testing (Table 1). In general, we found that the reactions that are well populated with literature data (e.g., N-dealkylation, aromatic and aliphatic hydroxylation, and O-glucuronidation) produced K-PLS models that perform well when assessed using the AUC value and the receiver operator curve plots (Table 1; Fig. 1). As expected, those models that are sparsely populated with few positive instances of a metabolite being observed corresponding to a particular reaction (generally non-cytochrome P450-related), are of poorer quality, indicating that no reliable classification can be made. Exceptions include N-hydroxylation and double bond peroxidation, in which there are remarkably few positive examples, but results are favorable for predictions, indicating that the examples provided generate useful rules based on path length descriptors. This preliminary work with both phase I and II reactions indicates that such an approach requires generally much larger databases than were used here, which will be available in later versions of MetaDrug. Despite this, K-PLS models for N-dealkylation, O-dealkylation, aromatic hydroxylation, aliphatic hydroxylation, O-glucuronidation, and O-sulfation reactions had AUC values between 0.75 and 0.84 and were able to predict between 61 and 79% active molecules upon leave-one-out testing while, more importantly, the balanced error predictions were between 70 and 82%. Therefore, this represents a useful method to classify the potential for an unknown molecule to undergo these particular metabolic reactions. However, this approach requires further testing using considerably more data for the many sparsely populated metabolic reactions. In addition, external validation of all models with large test sets of molecules will be required alongside measures to ensure that a prediction is reliable, such as those based on molecule similarity. This work represents the first occasion, to our knowledge, that K-PLS has been used for metabolite prediction, and the results obtained are promising with unbalanced datasets. The integration of this K-PLS approach with rule-based and other QSAR methods could result in a more effective method for metabolite prediction that would be useful in numerous drug discovery applications where reliable metabolite identification is important.
View this table:
[in this window]
[in a new window]
|
TABLE 1 Results of applying K-PLS models to human drug metabolism data for different reactions using 317 molecules
Percentage correct represents the prediction for positive instances for a metabolite. Balanced error represents the average of the correct positive and correct negative predictions.
|
|
 |
Footnotes
|
|---|
The development of MetaDrug was supported by National Institutes of Health Grants 1-R43-GM069124-01 and 2-R44-GM069124-02 "In silico Assessment of Drug Metabolism and Toxicity".
Competing Financial Interest: MetaDrug is a proprietary tool developed and licensed by GeneGo, Inc.
Article, publication date, and citation information can be found at http://dmd.aspetjournals.org.
doi:10.1124/dmd.106.013185.
ABBREVIATIONS: QSAR, quantitative structure-activity relationship; K-PLS, kernel-partial least squares; AUC, area under the curve.
Address correspondence to: Dr. Sean Ekins, ACT LLC, 601 Runnymede Ave., Jenkintown, PA 19046. E-mail ekinssean{at}yahoo.com, sekins{at}arnoldllc.com
 |
References
|
|---|
Balakin KV, Ekins S, Bugrim A, Ivanenkov YA, Korolev D, Nikolsky Y, Skorenko SA, Ivashchenko AA, Savchuk NP, and Nikolskaya T (2004) Kohonen maps for prediction of binding to human cytochrome P450 3A4. Drug Metab Dispos 32: 11831189.[Abstract/Free Full Text]
Bennett KP and Embrechts MJ (2003) An optimization perspective on kernel partial least squares regression, in Advances in Learning Theory; Methods, Models and Applications (Suykens JAK, Horvath G, Basu S, Micchelli J, and Vandewalle J eds), pp 227250. IOS Press, Amsterdam.
Borodina Y, Rudik A, Filimonov D, Kharchevnikova N, Dmitriev A, Blinova V, and Poroikov V (2004) A new statistical approach to predicting aromatic hydroxylation sites. Comparison with model-based approaches. J Chem Inf Comput Sci 44: 19982009.[CrossRef][Medline]
Boyer S and Zamora I (2002) New methods in predictive metabolism. J Comput-Aided Mol Des 16: 403413.[CrossRef]
Christianini N and Shawe-Taylor J (2000) Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge, MA.
de Graaf C, Vermeulen NP, and Feenstra KA (2005) Cytochrome P450 in silico: an integrative modeling approach. J Med Chem 48: 27252755.[CrossRef][Medline]
de Groot MJ (2006) Designing better drugs: predicting cytochrome P450 metabolism. Drug Discov Today 11: 601606.[CrossRef][Medline]
Ekins S, Andreyev S, Ryabov A, Kirilov E, Rakhmatulin EA, Bugrim A, and Nikolskaya T (2005a) Computational prediction of human drug metabolism. Exp Opin Drug Metab Toxicol 1: 303324.[CrossRef]
Ekins S, Andreyev S, Ryabov A, Kirillov E, Rakhmatulin EA, Sorokina S, Bugrim A, and Nikolskaya T (2006) A combined approach to drug metabolism and toxicity assessment. Drug Metab Dispos 34: 495503.[Abstract/Free Full Text]
Ekins S, de Groot M, and Jones JP (2001) Pharmacophore and three-dimensional quantitative structure activity relationship methods for modeling cytochrome P450 active sites. Drug Metab Dispos 29: 936944.[Abstract/Free Full Text]
Ekins S, Nikolsky Y, and Nikolskaya T (2005b) Techniques: application of systems biology to absorption, distribution, metabolism, excretion, and toxicity. Trends Pharmacol Sci 26: 202209.[CrossRef][Medline]
Embrechts M, Arciniegas F, Ozdemir M, and Momma M (2001) Scientific data mining with StripMiner, in IEEE Mountain Workshop on Soft Computing in Industrial Applications; 2001 June 2527; Virginia Tech, Blacksburg, VA.
Erhardt PW (2003) A human drug metabolism database: potential roles in the quantitative predictions of drug metabolism and metabolism-related drug-drug interactions. Curr Drug Metab 4: 411422.[CrossRef][Medline]
Korzekwa K, Ewing TJ, Kocher JP, and Carlson TJ (2004) Models for cytochrome P450-mediated metabolism, in Pharmaceutical Profiling in Drug Discovery for Lead Selection (Borchardt RT, Kerns EH, Lipinski CA, Thakker DR, and Wang B eds), pp 6980. AAPS Press, Arlington, VA.
Rosipal R and Trejo LJ (2001) Kernel Partial Least Squares regression in reproducing Kernel Hilbert Space. J Machine Learning Res 2: 97123.[CrossRef]
Shen M, Xiao Y, Golbraikh A, Gombar VK, and Tropsha A (2003) Development and validation of k-nearest neighbour QSPR models of metabolic stability of drug candidates. J Med Chem 46: 30133020.[CrossRef][Medline]
Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, and Woolsey J (2006) DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res 34: D668672.[Abstract/Free Full Text]
Young SS, Gombar VK, Emptage MR, Cariello NF, and Lambert C (2002) Mixture deconvolution and analysis of Ames mutagenicity data. Chemom Intell Lab Syst 60: 511.[CrossRef]
This article has been cited by other articles:

|
 |

|
 |
 
K. Fenner, J. Gao, S. Kramer, L. Ellis, and L. Wackett
Data-driven extraction of relative reasoning rules to limit combinatorial explosion in biodegradation pathway prediction
Bioinformatics,
September 15, 2008;
24(18):
2079 - 2085.
[Abstract]
[Full Text]
[PDF]
|
 |
|