Introduction

The human genome includes at least 57 genes coding for cytochrome P450 proteins, and 29 pseudo-genes.1 Among them, CYP2D6 (OMIM 124030) has a crucial role in the metabolism of over 40 drugs, including β-adrenergic blocking agents, antiarrhythmics, antipsychotics, antidepressants, and narcotic analgesics. Genetic variation at P450-CYP2D6 causes differences in the catalytic activity of the enzyme. In studies based on the metabolism of probe drugs, several phenotypic classes are defined. Individuals defined as poor metabolizers (PM) generally carry two nonfunctional CYP2D6 alleles, while the presence of one functional allele is generally sufficient to determine the extensive metabolizer (EM) phenotype. In addition, individuals carrying chromosomal rearrangements with two or more active copies of the CYP2D6 gene tend to be classified as ultrarapid metabolizers (UM). However, the limits among phenotypic classes are uncertain (e.g. several authors recognize an additional class of intermediate metabolizers, IM,2 and the continuous distribution of phenotypes suggests interaction with other genes and/or more complex interactions among alleles than simple dominance (see Figure 1 of Bertilsson et al3). Therefore, the relationship between metabolic ratios (ie the phenotypes) and the genotypes of CYP2D6 is far from clarified.

Figure 1
figure 1

Structure of the CYP2D6 gene (a), and schematic representation of a region of chromosome 22 with a duplicated CYP2D6 (b). Black boxes are exons.

The CYP2D6 gene is located on 22q13.1, at the 3′-end of the CYP2D cluster, downstream of the CYP2D8P and CYP2D7P pseudogenes.4, 5 To date, more than 70 different variants have been described (www.imm.ki.se/CYPalleles/cyp2d6), differing for single-base changes, short insertions and deletions, or for major rearrangements such as deletion6 and duplications7 of the whole gene. For the sake of clarity, in what follows we shall refer to these variants as haplotypes, whereas we shall use the term alleles only to refer to the alternative nucleotides at polymorphic DNA sites. Therefore, in this study we shall call haplotypes the genetic variants of CYP2D6 that are called alleles in most previous studies. In other words, here haplotype means a set of alleles transmitted together. The information about their association on a chromosome is called their phase.

Genetic variation at CYP2D6 is high, both among populations and among individuals of the same population.8 Two common haplotypes, *1 and *2, represent each between 7 and 40% of the gene pool in most world populations, and haplotype *5, the whole gene deletion, between 1 and 5%. Other defective haplotypes seem to show continental specificities, although no continent so far has been studied in sufficient detail to allow robust conclusions. Haplotype *4 represents 15–20% of the European gene pool,2, 9 haplotype *10 has a high frequency, up to 65% in Asia,10 haplotype *17 is present at frequencies around 20–30% in Africa,11 and all are rare or absent elsewhere in the world. Haplotype *4 is a loss-of-function haplotype, whereas haplotypes *10 and *17 confer a decreased enzyme activity.12, 13 The high frequency of these haplotypes in certain populations is hard to account for under a simple mutation-selection model. Haplotypes containing multiple gene copies were found at high frequencies only in specific regions like Ethiopia (29%)14 and Saudi Arabia (21%),15 while their frequency in the rest of the world ranges between 1 and 3%.

In this study, we genotyped CYP2D6 in six populations of the Mediterranean region, defining the haplotype phases from genotype data by computational methods. We further analysed these data, looking for levels of linkage disequilibrium (LD), trying to establish the evolutionary relationships among haplotypes, and testing for population differentiation among six European and Near-Eastern population samples. The shape of the evolutionary tree, and the similarities observed among populations living at large geographic distances, suggest that selective pressures have been important in determining variation at this locus.

Materials and methods

Sampling

We genotyped nine SNPs in the coding region of the CYP2D6 locus (Figure 1a) in 247 individuals from six populations sampled in the course of a project aimed at describing genetic diversity around the Mediterranean sea. We had available DNA from 51 Syrians,16 28 Ladin speakers from the Italian Alps,17 51 Southern Spaniards, 38 Basques, 48 Sardinians, and 31 Central Italians. All of them gave their informed consensus to the authors of the original studies for which the blood samples were collected. Basques, Ladin speakers, and Sardinians are known to be among the most genetically divergent European populations for several loci.18, 19 For each individual, we also tested for the presence of gene duplication or deletion of the whole coding region.

Genotyping

We used a combination of long polymerase chain reactions (PCRs) and a multiplex single-base extension system to detect gene deletion and duplication, and to characterize nine variable positions (Figure 1a) in the coding region of CYP2D6. In this way, we could identify the mutations which are generally supposed to define20, 21 more than 90% of the total known variants among Europeans. Mutation 1023 C>T is rare in Europe, but it was also considered in the screening because it is known to be common in African populations.

The genotyping method we summarize here will be described in greater detail in a forthcoming paper (Sistonen et al., submitted). Three parallel long PCRs were run for each sample. We looked for CYP2D6 gene duplication using the primer pair pCYP-207-F and pCYP-32-R.22 A control band was obtained adding the pCYP-13-F forward primer6 to the reaction mix. To detect the presence of the entire gene deletion, having at the same time an internal control band, we used again a set of three different primers: pCYP-13-F/pCYP-207-F/ pCYP-24-R6, 22 (Figure 2).

Figure 2
figure 2

A schematic description of CYP2D6 duplication-specific (a) and deletion-specific (b) genotyping reactions. Arrows correspond to the positions of primers, which are described in the text.

We obtained a 5.1 kb fragment containing all nine CYP2D6 exons with primers CYP2D6-F and CYP2D6-R.23 This product was used as a template to type 9 positions in one reaction, based on single-base primer extension with fluorescent labelled ddNTPs (ABI Prism SNaPshot Multiplex Kit, Applied Biosystems). The SNaPshot results were double-checked in the case of one (Basque) individual carrying the new haplotype *4Bas, using a combination of nested PCR and restriction fragment length polymorphism (RFLP) as described in Sachse et al.20

An additional PCR-SNaPshot assay was used to distinguish between different types of allele duplications. We amplified a 9.3 kb fragment spanning the genomic region between exon 9 and intron 2 of two subsequent CYP2D6 genes (Figure 1b), using primers P2x2F and P2x2R.24 The 9.3 kb fragment was used as a template in a specific SNaPshot reaction for exon 9, 1 and 2 polymorphic positions.

Haplotype determination

As a preliminary step, we inferred the haplotypes from SNP data using a PHASE reconstruction method.25 PHASE is a Bayesian statistical method for haplotype reconstruction, which incorporates the prior assumption that unresolved haplotypes will tend to be similar to known haplotypes, observed in unambiguous genotypes. PHASE does not rely on other assumptions, such as Hardy–Weinberg equilibrium. Computer simulations studies25, 26 clearly suggest that PHASE performs better than other algorithms, such as EM, in a wide range of conditions.

Hardy–Weinberg equilibrium

Departures from Hardy–Weinberg equilibrium were assessed in each sample, using the exact Fisher test implemented in the Arlequin package.27 The test was performed for each SNP separately, as well as for the haplotypes defined by PHASE.

Linkage disequilibrium

Once the haplotypes had been defined by PHASE, we wanted to know whether recombination has played a significant role in shaping CYP2D6 variation in the sampled populations. For that purpose, we tested for LD between each pair of SNPs in each sample. The significance of allelic association was evaluated by means of the extension of Fisher's exact test on contingency tables implemented in the Arlequin software,27 and using Bonferroni correction for multiple comparisons.28 Further, the D′ statistic,29 measuring the degree of association between SNP loci, was computed in each sample.

An evolutionary tree of haplotypes

The relationships among the haplotypes defined by the PHASE algorithm were summarized by a statistical parsimony tree or cladogram30 using the software TCS.31 With this method, a tree is constructed by connecting haplotypes when the probability that there is no intermediate haplotype in the sample is greater than 0.95. The tree obtained in this way does not have a root, that is, it is actually a network. One of its advantages, over strictly bifurcating trees, is that it allows one to identify, through the presence of loops in the reconstructed phylogeny, possible episodes of recombination and homoplasies.

Intra-population diversity indices

We calculated in our samples three summary indices of diversity, by means of the Arlequin software27: (1) gene diversity, that is, the probability that two randomly chosen haplotypes are different in the sample; (2) average gene diversity over the polymorphic sites, that is, the probability that two randomly chosen nucleotides in the sample at the same site are different, and (3) the mean number of pairwise differences between all pairs of haplotypes, π.

For comparison, we estimated these statistics in two additional, African, samples, one from Ghana (sample size: 193)11 and another from Tanzania (sample size: 106).32 The haplotypes were reconstructed in these samples from SNP data, according to the Human Cytochrome P450 (CYP) Allele Nomenclature (www.imm.ki.se/CYPalleles/cyp2d6).

Population differentiation

An analysis of molecular variance (AMOVA)33 was used to estimate how genetic variation is partitioned among population and individuals. Genetic distances (Φst) were also estimated between populations using the Arlequin software.27 The statistical significance of the variance components and of the genetic distances was evaluated by permuting haplotypes among populations.

Results

Haplotype determination

We carried out our survey using a SNaPshot test designed to determine the individual genotype at the nine most common polymorphic positions. Like all other SNP-based approaches, this genotyping procedure does not provide the haplotype phase, which can be directly determined only by using other labour-intensive approaches. Therefore, we inferred the haplotype phases from genotype data by a computational method. Table 1 shows the association of the SNPs into haplotypes. The haplotypes inferred in this way correspond exactly to those empirically proposed by previous studies (www.imm.ki.se/CYPalleles/cyp2d6). In addition, among the Basques, we identified for the first time a haplotype that could be considered a new variant of *4 (1661G>C, 1846G>A, and 4180G>C), bearing the splice site mutation responsible for loss of activity, but without the substitution 100 C>T, usually associated with a transition at 1846. We further confirmed this result by nested PCR followed by RFLP analysis for the four aforementioned SNPs.20

Table 1 SNP positions and frequencies of haplotypes defined by the PHASE algorithm in the whole set of samples

Haplotype and genotype frequencies, and Hardy–Weinberg equilibrium

The frequencies of 12 haplotypes are given in Table 2. Genotype frequencies are available upon request. Four genotypes, *1/*1, *1/*2, *2/*2, and *1/*4, account for more than half of the total in each population; none of the populations of this study showed significant departures from the Hardy–Weinberg equilibrium expectations. Also present were all the haplotypes reported to be associated with lower or null metabolic activity that could be identified by our SNaPshot test (*3, *6, *9, *10, and *17). Their frequencies (Table 2) are in good agreement with previous studies on Europe.2, 9, 20 One individual from Andalusia was found to be a carrier of the typically African haplotype *17. The high frequency of haplotypes with duplications in Syria (*1 × N+*2 × N=7.8%) agrees with previous observations of high frequencies of duplications in the Near East and in Africa.14, 15

Table 2 Frequencies of observed CYP2D6 haplotypes in six European populations

Linkage disequilibrium

We measured LD between pairs of SNPs distributed across the 4.1 kb genomic region we screened. However, sites with minor allele frequency <0.1 were excluded, because for them the power of the methods to detect LD is severely reduced.34 Gene duplications and deletions were excluded from this analysis. Both in the analysis of each single population, and in their joint analysis, the measure of LD was the highest observable (D=1) and remained significant after the Bonferroni correction (P<0.001) between all pairs of alleles.

Simple calculations on the number of haplotypes (Table 3) confirm a strong association among mutations. If the only evolutionary force generating new variants from one founder haplotype were the process of mutation, one would expect to observe up to n+1 haplotypes, where n is the number of SNPs considered. On the contrary, if mutations were reassorted in all possible manners by recombination, one would expect to observe 2n different haplotypes. Table 3 shows that the total number of haplotypes is less than n+1 in each of the six sampled populations. Therefore, both tests of LD, and counts of the numbers of different haplotypes, suggest that recurrent recombination has not played a significant role in shaping CYP2D6 variation at SNP sites. This result is in agreement with the identification of an LD block across a 390 kb region spanning CYP2D6,35 and allows us to treat the DNA region of interest as a single non-recombining genomic block36 in the subsequent tests.

Table 3 Observed and expected numbers of different haplotypes by population

Evolutionary trees of haplotypes

The network illustrating the relationships among the haplotypes found in our six populations is shown in Figure 3. As there is no information yet on CYP2D6 in the closest evolutionary relatives of humans, the great apes, it was impossible to root this network, and hence we have no objective basis to define the historical sequence of mutations that led to the current diversity. Three clusters are apparent in the network. The first cluster corresponds to haplotype *1 and its presumably derived (because rare) variants, namely the two nonfunctional haplotypes *3 and *6 and the reduced-function haplotype *9. The second cluster is mainly represented by haplotype *2. The 1023 C>T mutation-step leads to *17, known to be responsible for the expression of an enzyme with altered substrate affinity.13 The branch topology of the *10–*4 cluster is not totally resolved, as shown by the presence of a closed loop.

Figure 3
figure 3

Tree of haplotypes inferred from the analysis of nine SNPs. Figures on each branch indicate the mutated site for which two haplotypes differ. The size of the circles is proportional to the haplotype frequency in the samples of this study. Full-function, decreased-function and null-function haplotypes are, respectively, in white, grey and black.

Population diversity

Measures of genetic diversity within the six populations of this study were compared with those estimated from two African populations (Table 4). No Mann–Whitney nonparametric test on the six statistics estimated reached significance. Therefore, contrary to what is observed for the generality of nuclear markers,37 we found no evidence of excess diversity in Africa, regardless of whether or not major gene rearrangements (deletions and duplications) were considered. Actually, when deletions and duplications were excluded from the analyses, the African samples tended to show lower internal diversity than the samples of this study. In principle, this result might mean that CYP2D6 is exceptionally variable in Europe with respect to most other genome markers, but probably it reflects an inadequate characterization of the African samples. In other words, variation in Africa may have been underestimated, because the sites typed in these populations, with the exception of the 1023 C>T transition, are sites that were found to be polymorphic in European subjects.9 This interpretation is also supported by the poor correlation observed among Africans between CYP2D6 haplotypes and phenotypes, suggesting that the nucleotide substitutions affecting gene expression have not been thoroughly described in those samples.11, 38

Table 4 Measures of internal genetic diversity in two African populations, and in the six populations of this study

The AMOVA analysis in the six samples of this study revealed that within-population variation, that is, genetic differences between individuals within populations, account for 100% of total genetic variation (data not given). Therefore, the ΦST distances between the sampled populations are essentially 0. Nevertheless, the Syrian sample shows some increased genetic divergence from the other populations. Aklillu et al39 proposed that selective pressures associated with diet may have favoured the spread of duplicated or anyway very active alleles in East Africa. The high frequency of duplicated haplotypes in Syria (as well as in Arabia15) may be due to the same phenomenon.

Discussion

In this study, we typed nine SNPs at the CYP2D6 locus in nearly 500 chromosomes from six populations of the Mediterranean region. The haplotypes correspond to those commonly found using more complex genotyping techniques (http://www.imm.ki.se/CYPalleles/cyp2d6.htm), confirming the association on the chromosomes of alleles in haplotypes that can be recognized on the basis of a small number of key mutations. In addition, we described among the Basques a previously unknown haplotype, which we provisionally label *4Bas, because of the presence of the substitution in position 1846, which characterises all *4 haplotypes.

LD statistics reached the highest possible values among all suitable mutations, an indication, although not proof, that CYP2D6 may be transmitted as a single block of DNA, not disrupted by recombination. That finding has significant practical implications. In general, the human genome is believed to be largely composed of blocks (around 22 kb in size in European populations), within which recombination is scarce or absent, separated by hotspots of meiotic recombination36 (for a partly different view, see Anderson and Slatkin40). In agreement with previous work suggesting the existence of a large LD block covering the CYP2D cluster,35 our results show that, for most practical purposes, the CYP2D6 gene may be treated as a nonrecombining unit in the genome. That finding, if confirmed, would provide what seems a simple way to rapidly characterize a rather large genome region, thus increasing the power of association tests.41

Absence of inferred recombinant gametes also simplified the task to establish the evolutionary relationships among alleles, which we represented in tree form. The obvious next question to ask is whether this variation reflects the chance effects of mutation and drift, or whether sequence changes are reflecting some sort of selective pressures. To address this question, it will be necessary to sequence the entire gene in different populations, looking for the excess of intermediate-frequency haplotypes which typically reflect balancing selection.42 That balancing selection may be contributing to shaping diversity at CYP2D6 is suggested by the fact that subfunctional, and even non-functional, haplotypes such as *4 occur at high frequencies, apparently inconsistent with a high selection coefficient against them. At this stage, one can only speculate that protein variants that might interact with a broader range of xenobiotics, or with classes of xenobiotics that are not commonly metabolized by the common haplotypes, would have an obvious evolutionary advantage.

The six populations of the Mediterranean shores we studied do not appear genetically differentiated, although Syria, a geographic outlier, also shows a higher (if insignificant) frequency of gene duplications. Garte et al43 observed little genetic heterogeneity among Europeans for many genes involved in drug metabolism, including NAT2, CYP1A1, and GSTM1. The GSTT1 locus seemed to be the only exception, whereas CYP2D6 variation was not analysed because of the inconsistency of data sets available in the literature, mainly due to the different laboratory typing methods. In this paper, we used consistently the same method for genotyping all populations, so that we could examine the hypothesis of a European genetic homogeneity at the CYP2D6 locus too. An unexpected finding was that genetic differences are minimal at CYP2D6, despite the inclusion in our study of three of the main genetic outliers of Europe, namely the Basques, the Sardinians, and the populations of the Eastern Italian Alps. These populations are known to differ substantially from their neighbours, both at autosomal and at uniparentally transmitted loci,17, 44, 45, 46 which is largely regarded as an effect of reproductive isolation, caused by both geographic and linguistic barriers to gene flow.47, 48 However, a thorough study of autosomal SNPs in these populations has not been carried out yet. Therefore, the possibility should be considered, at least in principle, that most autosomal SNP loci are poorly differentiated in Europe, and that CYP2D6 be no exception. Alternatively, should greater differences emerge among Europeans for other autosomal SNPs (possibly reflecting population divergence through independent random genetic drift18), one could speculate that European populations underwent the same selective pressures for loci of pharmacogenetic interest, so that their CYP2D6 haplotype frequencies, as well as those of NAT2, CYP1A1, and GSTM1, are similar.

To further progress in our understanding of the origins and maintenance of diversity at CYP2D6, a crucial step will be the sequencing of the gene in some primate species. Only in this way shall we be able to identify the direction of the evolutionary change, thus estimating the age of the observed alleles, and hence potentially correlating those ages with the dates at which the main human demographic transitions took place. In parallel, identifying the ancestral alleles is crucial to understanding whether current diversity evolved from a fully functional haplotype, or if more complicated phenomena occurred. However, the results of this study already indicate that: (1) substitutions altering the amino-acid sequence (100 C>T, 1707 T>G, 1846 G>A, 2549 delA, 2613-15 delAGA, 2850 C>T, 4180 G>C) have been maintained within coding regions of the gene; (2) that these mutations occurred early enough in evolution to be shared by most or all Mediterranean populations; and (3) that, despite the evolutionary disadvantage associated with nonfunctional (*3, *4, *5, *6) and subfunctional (*9, *10, *17) haplotypes, these haplotypes are maintained at subpolymorphic, and even polymorphic (haplotype *4, reaching a frequency >0.20 among the Basques) frequencies in the populations.

Only a study of the DNA sequences, and a sound understanding of the relationships between genotypes and phenotypes, will tell us what sort of evolutionary pressures underlies this pattern of diversity. Traditionally, defective or abnormal genotypes are identified in individuals showing an abnormal ability to metabolize one molecule, often sparteine or debrisoquine. However, it is known that variants of the P450 protein with reduced ability to metabolize a certain class of molecules may have an increased ability to metabolize other molecules,49 which may easily pass undetected if the starting point of the analysis is the phenotype. The only way to address this problem is to reverse the approach, namely to start from a genetic characterization of the subjects, and to associate to each genotype the distribution of phenotypic values, represented by data on the metabolism of several drugs. This approach, if successful, may pave the ground for an individual-specific pharmaceutical treatment.