Abstract
The human UDP-glucuronosyltransferases (UGTs) have crucial roles in metabolizing and clearing numerous small lipophilic compounds. The UGT1A locus generates nine UGT1A mRNAs, 65 spliced transcripts, and 34 circular RNAs. In this study, our analysis of published UGT–RNA capture sequencing (CaptureSeq) datasets identified novel splice junctions that predict 24 variant UGT1A transcripts derived from ligation of exon 2 to unique sequences within the UGT1A first-exon region using cryptic donor splice sites. Of these variants, seven (1A1_n1, 1A3_n3, 1A4_n4, 1A5_n1, 1A8_n2, 1A9_n2, 1A10_n7) are predicted to encode UGT1A proteins with truncated aglycone-binding domains. We assessed their expression profiles and deregulation in cancer using four RNA sequencing (RNA-Seq) datasets of paired normal and cancerous drug-metabolizing tissues from large patient cohorts. Variants were generally coexpressed with their canonical counterparts with a higher relative abundance in tumor than in normal tissues. Variants showed tissue-specific expression with high interindividual variability but overall low abundance. However, 1A8_n2 showed high abundance in normal and cancerous colorectal tissues, with levels that approached or surpassed canonical 1A8 mRNA levels in many samples. We cloned 1A8_n2 and showed expression of the predicted protein (1A8_i3) in human embryonic kidney (HEK)293T cells. Glucuronidation assays with 4-methylumbelliferone (4MU) showed that 1A8_i3 had no activity and was unable to inhibit the activity of 1A8_i1 protein. In summary, the activation of cryptic donor splice sites within the UGT1A first-exon region expands the UGT1A transcriptome and proteome. The 1A8_n2 cryptic donor splice site is highly active in colorectal tissues, representing an important cis-regulatory element that negatively regulates the function of the UGT1A8 gene through pre-mRNA splicing.
SIGNIFICANT STATEMENT The UGT1A locus generates nine canonical mRNAs, 65 alternately spliced transcripts, and 34 different circular RNAs. The present study reports a series of novel UDP-glucuronosyltransferase (UGT)1A variants resulting from use of cryptic donor splice sites in both normal and cancerous tissues, several of which are predicted to encode variant UGT1A proteins with truncated aglycone-binding domains. Of these, 1A8_n2 shows exceptionally high abundance in colorectal tissues, highlighting its potential role in the first-pass metabolism in gut through the glucuronidation pathway.
Introduction
The human UDP-glucuronosyltransferase (UGT) superfamily comprises 22 functional enzymes that are divided into four subfamilies: UGT1, UGT2, UGT3, and UGT8 (Meech et al., 2019). UGTs have critical roles in metabolizing and clearing numerous drugs, environmental/dietary toxins and carcinogens, endogenous signaling molecules, and metabolic byproducts (Mackenzie et al., 2005). This is achieved by conjugating the target molecule, referred to as the aglycone, with a sugar moiety derived from a UDP-sugar donor. Aglycone targets are generally lipophilic, and the resulting conjugates are more hydrophilic, facilitating their excretion from the body. UGT proteins comprise an amino (N)-terminal half [∼270–290 amino acids (aa)] that contains the signal peptide and aglycone-binding domain as well as a carboxyl (C)-terminal half (∼250 aa) that contains the UDP-sugar binding domain, transmembrane region (TM), and cytoplasmic tail (Mackenzie et al., 2005). The N-terminal aglycone binding region diverges in sequence between different UGT family members, enabling the metabolism of structurally diverse substrates, whereas the C-terminal sugar binding domain is comparatively conserved among all UGTs (Mackenzie et al., 2005).
The pre-mRNA splicing process is mediated by splice sites and splicing factors. Canonical mRNAs are generated using canonical splice sites, whereas variant transcripts are frequently produced using cryptic splice sites located within exonic and intronic sequences (Baralle and Baralle, 2005; Lee and Rio, 2015; Aldalaqan et al., 2022; Keegan et al., 2022). Alternative splicing represents a critical mechanism for diversifying the transcriptome and proteome of the UGT gene family (Tourancheau et al., 2016, Hu et al., 2019). There are over 200 alternatively spliced UGT transcripts, and nearly 90% of them are predicted to encode variant UGT proteins or UGT-related sequences (Tourancheau et al., 2016). For example, the UGT1A locus comprises nine different first exons (1A1, 1A3–1A10) and exons 2–5 and code for nine functional UGT1A enzymes (1A1, 1A3–1A10) (Meech et al., 2019) (Fig. 1A). Transcription of the different first exons is initiated from their own promoters and generates nine different UGT1A pre-mRNAs containing all downstream first exons and exons 2–5. Each first exon has a canonical donor splice site but lacks an acceptor splice site (Fig. 1, A and B). During pre-mRNA splicing, this allows only the first 5′ first exon to be spliced to the downstream exons 2–5 to generate nine UGT1A mRNAs (UGT1A_v1s) that each contains a unique first exon and common exons 2–5 (Fig. 1A). For example, the UGT1A8 pre-mRNA contains all nine first exons and exons 2–5; however, only the 5′ first exon 1A8 is spliced to exons 2–5 to generate the canonical 1A8 mRNA (1A8_v1) (Fig. 1, B and C). In addition to nine canonical UGT1A_v1s, the UGT1A locus generates 65 variant transcripts and 34 circular RNAs (Girard et al., 2007; Lévesque et al., 2007; Bellemare et al., 2010; Tourancheau et al., 2016; Hu et al., 2021a). For example, two sets of nine variant UGT1A transcripts (designated UGT1A_v2s and UGT1A _v3s) that are generated using an alternative exon (exon 5b) encode the same nine variant UGT1A proteins (designated UGT1A_i2s) with a novel 10-aa C-terminal peptide (RKKQQSGRQM) replacing the 99-aa C-terminal region of canonical UGT1A proteins (designated UGT1A_i1s) (Girard et al., 2007; Lévesque et al., 2007). UGT1A_i2 proteins lack the TM region and cytoplasmic tail, and thus they have no transferase activity; however, they can interact with UGT1A_i1s to inhibit their activities, indicative of a dominant-negative regulation of glucuronidation activity (Bellemare et al., 2010; Rouleau et al., 2016).
A recent UGT–RNA capture sequencing (CaptureSeq) study reported several UGT1A variant transcripts (i.e., 1A1_n1, 1A3_n2, 1A4_n1, and 1A8_n1) with partial intronization of a first exon that encodes the aglycone-binding domain; however, their coding potential has not yet been described (Tourancheau et al., 2016). We hypothesized herein that with a preserved open reading frame (ORF), these variant UGT1A transcripts could encode a novel type of variant UGT1A proteins that have truncated aglycone-binding domains. For a comprehensive analysis of this type of variant transcripts, we reassessed the UGT-CaptureSeq datasets and identified a series of novel splice junctions predicting 24 different variant UGT1A transcripts that were generated by ligation of exon 2 to unique sequences within the UGT1A first-exon region using cryptic donor splice sites. Seven of these variant transcripts (1A1_n1, 1A3_n4, 1A4_n4, 1A5_n1, 1A8_n2, 1A9_n2, 1A10_n7) are predicted to encode variant UGT1A proteins with truncated aglycone-binding domains. We further assessed the expression profiles of these variants and their potential deregulation in cancer using four RNA sequencing (RNA-Seq) datasets of paired normal and cancerous drug-metabolizing tissues (liver, kidney, colon) with large patient cohorts. Guided by this analysis, we selected a variant UGT1A8 transcript (1A8_n2) that is highly expressed in normal and cancerous colorectal tissues for cloning and functional characterization.
Materials and Methods
Discovery of Sequence Reads Containing Splice Junctions That Ligate Exon 2 to Any Sequences within the UGT1A First-Exon Region Using UGT-Enriched CaptureSeq Datasets
A recent study used UGT-specific capture probes to generate 15 UGT-enriched CaptureSeq datasets from three normal tissues (i.e., liver, kidney, intestine/colon) and two cancerous tissues (i.e., kidney, intestine/colon) (Rouleau et al., 2016; Tourancheau et al., 2018). Each CaptureSeq dataset was generated from RNA samples that contained a pool of three to five individual samples, and the UGT-enrichment factor was estimated to be approximately 1000-fold. An analysis of these CaptureSeq datasets identified about 60 variant UGT1A transcripts (Tourancheau et al., 2016). In the present study, we reanalyzed these CaptureSeq datasets with focus to identify splice junctions that ligated exon 2 using the exon 2 acceptor splice site to any sequences within the UGT1A first-exon region using either a canonical or a cryptic donor splice site. Briefly, the UGT-CaptureSeq data (GSE80463) was downloaded from NCBI GEO, and the 100-nucleotide (nt) paired-end reads were merged into a single 200-nt fragment using Illumina Paired-End reAd mergeR (PEAR) (Zhang et al., 2014). The merged reads were searched for sequences in which exon 2 (demarcated by the sequence 5′GAATTTGAAGCCTACATTAATGCTTCTGGA-3′) was ligated to any sequence within the upstream UGT1A first exon-region. These reads represented splicing events between the exon 2 acceptor splice site and either the canonical exon 1 donor splice sites or cryptic donor splice sites located anywhere within the first-exon region. This analysis generated over 7000 splice junction sequences (Supplemental Table 1). We extracted a further 15 nt of downstream sequence from each of these reads and then aligned the sequences to definitively identify novel splice junctions.
Quantitation of Variant UGT1A Transcripts Using RNA-Seq Datasets from Paired Normal and Cancerous Drug-Metabolizing Tissues
The Sequence Read Archive (SRA) platform stores the largest publicly available repository of high-throughput sequencing data (https://www.ncbi.nlm.nih.gov/sra). For search of UGT1A transcripts in RNA-Seq datasets through the SRA platform, we designed transcript-specific probes that spanned either the canonical or cryptic splice junctions identified in the previous section (Supplemental Table 2). Together these probes identified all canonical UGT1A transcripts as well as 24 different novel variant transcripts. Using the SRA platform, we obtained sequence read data of canonical (_v1) and variant UGT1A transcripts from the 15 UGT-CaptureSeq datasets (SRP073607) and four other RNA-Seq datasets generated from paired normal and cancerous tissues. The four RNA-Seq datasets included: 1) 103 paired colorectal cancer and adjacent normal tissues (SRP107326) (Wu et al., 2020); 2) 22 paired urinary bladder cancer and adjacent normal tissues (SRP212702) (Chen et al., 2019); 3) 61 paired kidney cancer and adjacent normal tissues (SRP238334) (Wang et al., 2020); and 4) 65 paired liver cancer and adjacent normal tissues (SRP401130) (Long et al., 2022). These four RNA-Seq datasets were generated without UGT-enrichment using different platforms, including NextSeq 500 (Wu et al., 2020), HiSeq X Ten (Chen et al., 2019; Wang et al., 2020), and NovaSeq 6000 (Long et al., 2022). The numbers of sequence reads containing splice junctions corresponding to canonical (A1, 1A3, 1A4, 1A5, 1A8, 1A9, 1A10) and variant (1A1_n1, 1A3_n3, 1A3_n4, 1A4_n4, 1A5_n1, 1A8_n2, 1A9_n2, 1A10_n7) UGT1A transcripts are provided in Supplemental Tables 3–6. The expression level of each transcript was normalized using the number of the total sequence reads in the same dataset and presented as normalized reads per 109 total sequence reads.
Cloning of Variant Transcripts 1A8_n2 or 1A10_n7 from Colorectal Cancer HT-29 Cells
The colorectal cancer HT-29 cell line was purchased from American Type Culture Collection (ATCC, Manassas, VA), and cultured in Dulbecco’s modified Eagle’s medium (DMEM) supplemented with 10% fetal bovine serum (FBS) as previously reported (Hu et al., 2021a). Total RNA was extracted from HT-29 cells using TRIzol reagent (Thermo Fisher Scientific, Carlsbad, CA). Reverse transcription (RT) was conducted using Invitrogen reagents as previously reported (Hu et al., 2018). Briefly, 1 μg total RNA was treated with DNase I and then subjected to RT in a 20-μl reaction containing Superscript III (50 units) and random hexamer primers (50 ng) at 50°C for 50 minutes. The resultant cDNAs were diluted five times in RNase-free H2O prior to polymerase chain reaction (PCR) as described below.
To clone the UGT1A8 or UGT1A10 full coding sequence, we designed a common reverse primer (5′-TCTCTAGAGGTACCACGCGTTCAATGGGTCTTGGATTTG-3′) and a forward primer specific for 1A8 (5′-CACTATAGGCTAGCCTCGAGATGGCTCGCACAGGGTGGACCAGCCCCA-3′) or 1A10 (5′-CACTATAGGCTAGCCTCGAGATGGCTCGCGCAGGGTGGACCAGCCCCG-3′). PCRs were carried out using HT-29 cDNA and Phusion High-Fidelity DNA Polymerase (New England Biolabs Ltd.). After verification on agarose gels, PCR products were purified and cloned into the XhoI/MluI sites of the pEF_IRESpuro6 expression vector as previously reported (Hu et al., 2018, 2021a). The identities of the resultant constructs were confirmed by DNA sequencing.
Expression of UGT1A8_i3 and Western Blotting Assays
The human embryonic kidney (HEK)293T cell line was purchased from American Type Culture Collection (ATCC, Manassas, VA), and cultured in Dulbecco’s modified Eagle’s medium (DMEM) supplemented with 10% fetal bovine serum (FBS). HEK293T cells were transfected with the UGT1A8_v1 or variant UGT1A8_n2 expression plasmids, either individually or in combination using Lipofectamine2000 as previously reported (Hu et al., 2018). To ensure that the same amount of DNA was transfected in all conditions, a control plasmid was added where required to bring the total plasmid DNA amount to 1 μg per well. Transfected cells were harvested after 48 hours and lysates prepared in TE buffer (10 mM Tris-HCl, 1 mM EDTA, pH 7.6). The resultant lysates were subjected to western blotting assays to verify the expression of canonical (1A8_i1) or variant (designated 1A8_i3) UGT1A8 proteins using a polyclonal antibody that recognizes the C-terminal region of human UGT1A proteins (catalog number 458410, 1:2000 dilution; Gentest, Woburn, MA). Briefly, equal amounts of HEK293T lysate proteins were separated by SDS-PAGE (10%) and transferred to Trans-Blot nitrocellulose membrane (Bio-Rad) for 1 hour at 100 V, 4°C. Precision Plus Protein WesternC blotting standards (1610385; Bio-Rad) were used as a molecular weight marker. Membranes were blocked in 4% (w/v) nonfat dry milk in Tris-buffered saline (TBST) for 1 hour, followed by incubation with the primary antibody at 1:2000 dilution overnight at 4°C. After washing, secondary goat-anti-rabbit antibody (65-6120; Thermo Fisher Scientific) was applied at 1:2000 dilution for 2 hours. Membranes were washed, treated with enhanced SuperSignal West Pico chemiluminescent (ECL) HRP substrate (Thermo Fisher Scientific) and imaged using an ImageQuant LAS 4000 (GE Healthcare Life Sciences). Densitometry quantification of protein bands was determined using the ‘Gel Analysis’ tools in the ImageJ software package (NIH). These analyses were performed for at least three independent transfection experiments.
Glucuronidation Assays
We further performed 4-methylumbelliferone (4MU) glucuronidation assays with lysates from HEK293T that were transfected with the canonical UGT1A8_v1 or variant UGT1A8_n2 expression construct, alone and in combination as described above. The assays were carried out in a 200-μl reaction consisting of equal amounts HEK293T lysate protein from each sample, 100 mM potassium phosphate buffer (pH 7.4), 4 mM MgCl2, 400 μM 4MU, and 5 mM UDPGA. The reaction was incubated at 37°C for 2 hours and then terminated by the addition of 2 μl 70% perchloric acid on ice. The reaction mixtures were centrifuged at 5000 g for 10 minutes at 4°C, and 150 μl supernatant was transferred to HPLC vials for analysis using high-performance liquid chromatography on an Agilent 1100 series instrument (Agilent Technologies, Sydney, NSW, Australia) as previously reported (Hu et al., 2018). Briefly, analytes were separated on a Waters Nova-Pak C18 column (60Å, 4 μm, 3.9 mm × 150 mm). The mobile phase, delivered at a flow rate of 1 ml/min, consisted of two solutions mixed according to a gradient timetable: phase A [10 mM triethylamine/perchloric acid buffer (pH 2.5), 10% acetonitrile] and phase B (100% acetonitrile). Initial conditions were 94% phase A – 6% phase B for 3 minutes, followed by 65% phase A – 35% phase B, which was held constant for 1 minute before returning to the starting conditions. The resulting 4-methylumbelliferone glucuronide (4MUG) compound was detected by UV at an excitation wavelength of 316 nm. The retention time for 4MUG was approximately 3.6 minutes. A standard curve (0.625 μM to 12.5 μM) generated using 4MUG standards (Merck Australia, Darmstadt, Germany) was used to allow absolute quantification of 4MUG in samples.
To normalize the 4MU glucuronidation activity in each sample, the values were divided by the band intensities of 1A8_i1 and 1A8_i3 proteins estimated by densitometric analysis of western blots as described above.
Statistical Analysis
The potential correlation between the expression levels of canonical and variant UGT1A transcripts in normal or cancerous tissues was assessed by Spearman ranking correlation analysis. The potential deregulation in the expression levels of canonical or variant UGT1A transcripts in cancerous tissues compared with those in normal tissues was assessed by Wilcoxon matched pairs signed rank test. Both statistical analyses were conducted using GraphPad Prism (version 9.1.1) (GraphPad Software, San Diego, CA). P < 0.05 was considered statistically significant. According to recent guidelines for displaying data and reporting data analysis and statistical methods in experimental biology (Michel et al., 2020), all findings reported in this study are considered exploratory.
Results
Discovery of Novel UGT1A Variants Generated by Cryptic Donor Splice Sites within the UGT1A First-Exon Region and Their Coding Potentials
A recent CaptureSeq study examined the transcriptome of the UGT1A locus in drug-metabolizing tissues (Rouleau et al., 2016). We reanalyzed these UGT-CaptureSeq datasets using a search strategy that identified splice junctions that ligated exon 2 using its acceptor splice site to any sequences within the UGT1A first-exon region using canonical or cryptic donor splice sites. In total, 7296 splice junctional sequences were identified (Supplemental Table 1). Of these splice junctional sequences, 7195 (97%) corresponded to ligation of exon 2 to one of the nine canonical UGT1A first exons via their canonical donor splice sites, indicating that they were derived from canonical UGT1A_v1 transcripts. The remaining 201 sequences (3% of total junctional sequences) contained one of 24 different variant splice junctions in which exon 2 was ligated to a novel sequence within the UGT1A first-exon region using cryptic donor splice sites (Fig. 2A). Based on these splicing junctions and the currently accepted UGT1A nomenclature (Tourancheau et al., 2016), we predict 24 different variant transcripts that are named as 1) isoform-specific variants generated using cryptic donor splice sites within canonical (1A1_n1, 1A3_n3, 1A3_n4, 1A4_n4, 1A5_n1, 1A8_n2, 1A9_n2, 1A10_n7) or pseudo (1A2P_n10, 1A2P_n11) first exons and 2) other UGT1A variants without first-exon sequences that are generated using cryptic donor splice sites within sequences between first exons (designated 1A1_n9, 1A1_n10, 1A1_n11, 1A1_n12, 1A1_n13, 1A1_n14, 1A1_n15, 1A1_n16, 1A1_n17, 1A1_n18, 1A1_n19, 1A1_n20, 1A1_n21, 1A1_n22 (Fig. 2B; Supplemental Table 7).
The type of protein that might be encoded by each of these variant transcripts depends upon the location of their cryptic donor splice sites. In 16 of the variants, the cryptic donor splice sites were located in an intervening region between the nine canonical first exons (i.e., 1A_n9 through 1A_n22, 1A2P_n10, and 1A2P_n11) (Fig. 2B). Ligation of these donor splice sites to the exon 2 acceptor splice site results in exonization of these intervening sequences and generates transcripts that contain a novel first exon and exons 2–5. This likely leads to a frameshift at the exon 1/exon 2 junction and a premature stop codon. Hence, proteins encoded by these variants will only contain the region encoded by the novel exon 1 but lack regions encoded by the downstream exons 2–5. However, if the novel first exon codes for a peptide that is in-frame with the C-terminal half encoded by exons 2–5, these variants are predicted to encode a novel type of UGT1A proteins that contain the intact C-terminal half and a novel short N-terminal peptide. This possibility remains to be explored.
In contrast, eight of the novel cryptic donor splice sites were located within a canonical UGT1A first exon (i.e., 1A1_n1, 1A3_n3, 1A3_n4, 1A4_n4, 1A5_n1, 1A8_n2, 1A9_n2, 1A10_n7) (Fig. 2B); these sites all conformed to the GT dinucleotide splice signal convention. (Supplemental Fig. 1). Ligation of these donor splice sites to exon 2 results in partial intronization of exon 1 and generates transcripts that have a 3′-truncated exon 1 and exons 2–5 (Supplemental Fig. 1). Proteins encoded by these variants lack regions encoded by the intronized region of exon 1; however, if the reading frame is maintained at the junction with exon 2, the predicted protein will contain the domains encoded by exons 2–5. In seven of these eight variants (1A1_n1, 1A3_n4, 1A4_n4, 1A5_n1, 1A8_n2, 1A9_n2, 1A10_n7), the reading frame was in fact maintained in this way. These transcripts are predicted to encode variant UGT1A proteins with truncated aglycone-binding domains (Supplemental Fig. 2; Table 1). The cryptic donor splice sites used in 1A3_n4, 1A4_n4, and 1A5_n1 are conserved in relative position; thus, their encoded proteins are the same size (309 aa) and contain only the first 64 aa of the aglycone binding domain (Fig. 3A; Supplemental Fig. 2). Similarly, the 1A8_n2 and 1A9_n2 transcripts were also generated using a conserved donor splice site, and their encoded proteins are thus the same size (346 aa) and contain only 101 aa of the aglycone binding domain (Fig. 3B; Supplemental Fig. 2). We did not detect similar variant transcripts from 1A7 and 1A10, as they lack a cryptic donor splice site in this position (Fig. 3B). The 1A1_n1 variant encodes the longest protein (487 aa) containing the N-terminal 241 aa encoded by the canonical UGT1A1 first exon and the intact C-terminal half encoded by exons 2–5. This variant was previously reported and is produced by intronization of the 3′ 141 nucleotides of the 1A1 first exon using the 1A1_n1 donor splice site (Tourancheau et al., 2018). Three other variants (1A3_n4, 1A4_n4, 1A8_n2) are superficially similar to those (1A3_n2, 1A4_n1, 1A8_n1, respectively) schematized by Tourancheau et al. (2016). However, as the splice junction sequences relating to these variants were not reported in that study, whether they are the same cannot be determined.
The ORF of 1A3_n3 was disrupted at the exon 1/exon 2 junction, and 1A3_n3 thus encodes a truncated protein (239 aa) with an intact N-terminal signal peptide, a truncated aglycone-binding domain, and a 77-aa novel C-terminal peptide in place of the native C-terminal domain (Supplemental Fig. 2; Table 1).
Expression Profiles of Variant UGT1A Transcripts in UGT-Enriched CaptureSeq Datasets
Using transcript-specific probes and the SRA platform, we initially quantified sequence reads containing each of the 24 different transcript-specific splice junctions using UGT-enriched CaptureSeq data from normal liver, kidney, and intestine/colon tissues and cancerous kidney and intestine/colon tissues (Tourancheau et al., 2016) (Supplemental Table 2). For example, all of the splice junctional reads identified for three variant transcripts (1A4_n4, 1A8_n2, 1A9_n2) are shown in Supplemental Fig. 3. In general, the variants showed tissue-specific expression consistent with their canonical counterparts but with high variability across samples (Supplemental Fig. 4). Seven variant transcripts (1A_n9, 1A_n10, 1A_n14, 1A4_n4, 1A8_n2, 1A9_n2, 1A_n18) were found in more than half of the samples. All but four (1A10_n7, 1A_n17, 1A_n19, 1A_n20) transcripts were found in normal liver tissue. This is consistent with the fact that the cryptic donor splice sites used to generate these variants are located close to UGT1A first exons that are poorly expressed in liver (e.g., 1A10, 1A11P, 1A12P) (Hu et al., 2019). 1A9_n2 was found in 11 samples, with the highest expression in normal kidney tissues. The variant with the widest distribution was 1A8_n2, which was found in all 15 samples, with the highest expression in normal intestine/colon tissues (Supplemental Fig. 4A). The canonical UGT1A8_v1 transcript is also considered extrahepatic and shows highest expression in gut (Hu et al., 2019). 1A8_n2 showed abundant expression, with levels that were similar to or higher than those of 1A8_v1 in several samples (Supplemental Fig. 4B). In contrast, 1A9_n2 and all other variant transcripts had very low levels that were generally less than 1% of their canonical _v1 levels (Supplemental Fig. 4, C and D; Supplemental Table 2). Several variants showed tissue-selective expression, such as 1A_n10 (normal intestine/colon), 1A_n12 (normal liver), 1A_n14 (normal liver and intestine/colon), and 1A4_n4 (normal liver) (Supplemental Table 2).
Quantification of Variant UGT1A Transcripts in RNA-Seq Datasets without UGT-Enrichment from Paired Normal and Cancerous Drug-Metabolizing Tissues
The above-mentioned CaptureSeq samples were enriched using UGT-specific tiling probe arrays; hence, it is possible that they might not fully recapitulate the relative abundance of canonical and variant transcripts in bulk RNA. Furthermore, the CaptureSeq samples were pooled from three to five individual tissues that were not suitable for assessing the interindividual expression variability of variant transcripts. To more accurately define the expression profiles of UGT1A transcripts and their potential deregulation in cancer, we assessed four other RNA-Seq datasets without UGT-enrichment that were generated from paired normal and cancerous drug-metabolizing tissues. Using transcript-specific probes and the SRA platform, we focused on analysis of the eight variant transcripts (1A1_n1, 1A3_n3, 1A3_n4, 1A4_n4, 1A5_n1, 1A8_n2, 1A9_n2, 1A10_n7) that are predicted to encode variant UGT1A proteins (Table 1) as described in detail below.
Four canonical UGT1A mRNAs (i.e., 1A1_v1, 1A3_v1, 1A4_v1, 1A9_v1) are known to be highly expressed in liver (Nakamura et al., 2008; Izukawa et al., 2009; Court et al., 2012; Hu et al., 2019, 2021b). We assessed the expression of these canonical transcripts together with their newly identified variants (1A1_n1, 1A3_n3, 1A3_n4, 1A4_n4, 1A5_n1, 1A9_n2) using RNA-Seq data from 65 pairs of hepatocellular carcinoma (HCC) tumors and adjacent normal tissues (Long et al., 2022). The four canonical transcripts were found in all samples, with three of the four (1A1_v1, 1A3_v1, 1A4_v1) significantly downregulated in tumor tissues compared with adjacent normal tissues (Fig. 4A, a–d; Supplemental Table 3). Variants 1A1_n1 and 1A3_n4 were found in very few samples and showed low abundance (Table 2). Variants 1A3_n3 and 1A5_n1 were not found in any tissue. Variant 1A4_n4 was found in 37 normal tissues and 24 tumor tissues, with highly variable levels across normal and cancer samples, but statistical analysis did not show significantly differential expression between paired tumor and normal samples (Fig. 4Ae). The 1A4_n4/1A4_v1 expression ratios were <5% in most samples, with a ratio of >10% in two normal tissues and a ratio of >15% in four tumor samples (Fig. 4B). Variant 1A9_n2 was found in eight normal and seven tumor tissues, but the 1A9_n2/1A9_v1 expression ratio was generally <5% in these samples (Supplemental Table 3; Table 2). Overall, 1A4_n4 was the only variant widely expressed in normal and cancerous liver tissues, with a low abundance relative to its canonical 1A4_v1.
In addition to liver, canonical UGT1A9_v1 is also highly expressed in kidney (Court et al., 2012; Hu et al., 2019, 2021b). Expression of canonical (1A9_v1) and variant (1A9_n2) transcripts was assessed using RNA-Seq data from 61 pairs of clear cell renal cell carcinoma (RCC) tumors and adjacent normal tissues (Wang et al., 2020). 1A9_v1 was found in 60 normal and 59 tumor tissues and was significantly downregulated in tumor compared with adjacent normal tissues (Fig. 5Aa; Supplemental Table 4). 1A9_n2 was expressed in 18 normal samples and nine tumor samples (Table 2) and was positively correlated with 1A9_v1 expression in normal tissues (Fig. 5Ab). However, the number of tissues with 1A9_n2 expression was insufficient to assess its potential differential expression between normal and tumor tissues. The 1A9_n2/1A9_v1 expression ratios were low (<4%) across normal and tumor tissues (Fig. 5B).
Canonical 1A8_v1 and 1A10_v1 are abundantly expressed in extrahepatic tissues such as colon and rectum (Nakamura et al., 2008; Izukawa et al., 2009; Court et al., 2012; Hu et al., 2021b). Expression of 1A8_v1 and 1A10_v1 and their variants (1A8_n2, 1A10_n7) was assessed using RNA-Seq datasets from 103 paired normal and cancerous tissues from colorectal cancer patients (Wu et al., 2020) (Supplemental Table 5). 1A8_v1 was found in all 103 normal tissues and 78 cancerous tissues; 1A8_n2 was found in 100 normal tissues and 63 cancerous tissues (Supplemental Table 5; Table 2). Both 1A8_v1 and 1A8_n2 were significantly downregulated in tumors compared with matched normal tissues (Fig. 6A, a and b). Moreover, the 1A8_v1 and 1A8_n2 levels were positively correlated in both normal and tumor tissue (Fig. 6A, c and d; Supplemental Table 5). Of the 100 normal tissues with coexpression of 1A8_v1 and 1A8_n2, over 50% had a 1A8_n2/1A8_v1 ratio between 0.2 and 0.5, with only a few showing ratios >1.0 (Fig. 6B; Supplemental Table 5). By contrast, of the 59 tumor tissues with coexpression of 1A8_v1 and 1A8_n2, about 30% showed a 1A8_n2/1A8_v1 ratio ≥1.0. To better assess whether a 1A8_n2/1A8_v1 ratio is correlated with the cancerous state, we examined the set of patients with 1A8_v1 and 1A8_n2 coexpression in both tumor and normal samples. Of these 59 patients, 42 patients had a higher 1A8_n2/1A8_v1 ratio in the tumor than the matched normal tissue (Fig. 6B). Overall, we found that 1A8_n2 was abundantly expressed in colorectal cancer (CRC) tumor and adjacent normal tissues and that tumor tissues showed higher 1A8_n2/1A8_v1 ratios.
As expected, we found canonical 1A10_v1 in all 103 normal colon/rectum tissues and 95 CRC tumors (Supplemental Table 5). However, the variant 1A10_n7 was only found in seven CRC tumors, with a 1A10_n7/1A10_v1 ratio <0.01. Thus, unlike 1A8_n2, 1A10_n7 appears to be a rare variant transcript in CRC.
In addition to gastrointestinal tissues, 1A8_v1 is also highly expressed in urinary bladder tissues (Court et al., 2012; Hu et al., 2019, 2021b). Analysis of RNA-Seq data from 22 pairs of urinary bladder cancer tissue and adjacent normal bladder tissues showed highly variable expression of both 1A8_v1 and 1A8_n2 in normal and cancerous tissues (Chen et al., 2019) (Supplemental Table 6; Table 2). 1A8_v1 was found in 22 normal and 12 tumor tissues and showed significantly decreased expression in tumor compared with matched normal tissues (Fig. 7Aa); 1A8_n2 was found in 18 normal and nine tumor tissues; however, it did not show consistent downregulation in tumors compared with matched normal tissues (Fig. 7Ab). The 1A8_v1 and 1A8_n2 expression levels were positively correlated in both normal and cancerous tissues (Fig. 7A, c and d). The expression ratio of 1A8_n2/1A8_v1was highly variable, ranging from 0.17 to 3.0 in normal tissues and from 0.21 to 2 in tumor tissues (Fig. 7B). Overall, 1A8_n2 is abundantly expressed in normal and tumor urinary bladder tissues; however, its potential deregulation in bladder cancer requires further analysis with a larger patient cohort.
Cloning and Functional Analysis of Variant Transcripts UGT1A8_n2 and UGT1A10_n7
Our analysis of the RNA-Seq datasets (SRR15145268, SRR15145269, SRR15145270) generated from the colorectal cancer HT-29 cell line showed expression of both canonical (1A8_v1, 1A10_v1) and variant transcripts (1A8_n2, 1A10_n7) in this cell line (data not shown). The variant exon1-exon2 splice junctions identified in short-read RNA-Seq datasets (100–150 nt in length) were presumed to represent longer transcripts that also include exons 3–5. To confirm this, we attempted to amplify complete open reading frames (ORFs) for variants 1A8_n2 and 1A10_n7 using cDNA from HT-29 cells. Using primers that span the initiation and termination codons of UGT1A8 or UGT1A10, we amplified both expected canonical products (1593 nt) as well as shorter products (∼1000 bp) (Supplemental Fig. 5A). We cloned these PCR products into expression vectors, and sequencing confirmed that the shorter products contained complete ORFs corresponding to the novel variants 1A8_n2 and 1A10_n7, whereas the longer products contained the complete ORFs for canonical 1A8_v1 and 1A10_v1. The sequencing chromatograms for 1A8_n2 and 1A10_n7 covering the 3′-truncated first exon and exons 2–5 are given in Supplemental Figs. 6 and 7, respectively. The exon structures of the pre-mRNA, mRNA (1A8_v1, 1A10_v1), and variant transcript (1A8_n2, 1A10_n7) of UGT1A8 and UGT1A10 are given in Supplemental Fig. 8. These observations provide evidence for the in vivo synthesis of the predicted full-length variant transcripts identified from this study.
The resultant vectors carrying the 1A8_n2 ORF and the 1A8_v1 ORF were transfected alone or in combination into HEK293T cells. Immunoblotting with a pan-UGT1A antibody identified proteins of the expected size in each condition, providing evidence for the stable expression of both canonical (1A8_i1, ∼50 kDa) and the variant protein encoded by UGT1A8_n2 transcript, which we designated 1A8_i3 (∼37 kDa) in accordance with current UGT1A nomenclature (Fig. 5B).
We further performed glucuronidation assays with 4MU as a substrate. Our results showed that 1A8_i3 had no activity with 4MU and was unable to inhibit the activity of canonical 1A8_i1 protein with this substrate (Fig. 5C). Previous studies have shown that the inhibitory activity of the C-terminally truncated UGT1A_i2 proteins were substrate specific (Benoit-Biancamano et al., 2009; Bellemare J et al., 2010). It remains to be determined whether 1A8_i3 can inhibit the activity of 1A8_i1 with substrates other than 4MU.
Discussion
The landscape of differently spliced UGT transcripts has expanded rapidly in recent years, with multiple research groups contributing to the identification, and in some cases characterization, of novel variants (Girard et al., 2007; Lévesque et al., 2007; Bushey and Lazarus, 2012; Rouleau et al., 2016; Tourancheau et al., 2016, Hu et al., 2018, 2021a). However, as deep sequencing has revealed ever more splicing heterogeneity, the question arises of whether this is biologically meaningful or simply noise resulting from stochastic fluctuations in spliceosome activity (Wan and Larson, 2018). Relative transcript abundance and differential expression between tissues or disease states are potential clues to biologic function; however, to date such analysis of UGT variant transcripts has been limited due to the lack of analysis of large cohorts of healthy and diseased human tissues (Jones et al., 2012; Tourancheau et al., 2018). In this study, we analyzed RNA-Seq data from large patient cohorts to characterize the heterogeneity of UGT variant production in vivo, to determine relative transcript abundance, and to assess whether UGT splicing is dysregulated in neoplastic contexts.
The extensively characterized C-terminally truncated proteins (UGT1A_i2) have no transferase activity, but they are reported to inhibit the activity of canonical UGT1A_i1 proteins through protein-protein interaction (Girard et al., 2007; Lévesque et al., 2007; Bellemare et al., 2010; Rouleau et al., 2016). To date variant UGT1A proteins containing internal deletions of the aglycone-binding domain have not been studied. In the present study, we identified 24 different variant transcripts that were derived from splicing of the exon 2 acceptor splice site to novel cryptic donor splice sites within the UGT1A first-exon region. Seven (i.e., 1A1_n1, 1A3_n4, 1A4_n4, 1A5_n1, 1A8_n2, 1A9_n2, 1A10_n7) of these variants are predicted to encode a novel type of variant UGT1A proteins with internal deletions (47–225 aa) within the aglycone binding domain. Our glucuronidation assays with recombinant protein 1A8_i3 show that these aglycone-binding domain-truncated UGT1A proteins had no activity with 4MU and that they were not able to inhibit the activity of their canonical counterparts. The lack of inhibitory activity may suggest the necessity of the aglycone-binding domain for physical interaction between canonical and variant UGT1A proteins, which is essential for the inhibitory activity of the C-terminally truncated UGT1A_i2. This hypothesis remains to be investigated.
Our analysis of the expression profiles of UGT1A variant transcripts using large patient cohorts revealed that variants showed broadly the same tissue distribution as their canonical counterparts. Most UGT1As are expressed in liver, some (e.g., 1A9) are additionally expressed in kidney and gastrointestinal tract (GIT), and three (1A7, 1A8 and 1A10) are considered extrahepatic showing expression in GIT but not liver (Hu et al., 2019). Consistent with these expression patterns, 1A8_n2 and 1A10_n7 transcripts were found in colorectal tissues and 1A9_n2 was found in liver and kidney. These findings are consistent with previous observations that canonical and variant UGT1A transcripts of the same UGT1As are coexpressed in drug-metabolizing tissues (Tourancheau et al., 2018). Our analysis of paired normal and tumor samples suggested that some variant transcripts may be dysregulated in cancers. For example, both 1A8_v1 and 1A8_n2 were downregulated in CRC; however, the average 1A8_n2/1A8_v1 ratio was almost 2-fold higher in CRC than in normal tissue, suggesting differential regulation of the canonical and variant transcripts. Higher UGT1A_v2/v3 versus _v1 expression ratios were also observed in kidney and intestine/colon tumor compared with normal tissues, highlighting similar differential regulation (Tourancheau et al., 2018). Collectively, these data suggest the potential dysregulation of UGT splicing in cancers.
The canonical view is that each of the nine functional UGT1A first exons has one canonical donor splice site and lacks an acceptor splice site (Fig. 1A). This is essential for the generation of nine functional UGT1A mRNAs (UGT1A_v1s) that each has a unique first exon and common exons 2–5 and preventing the generation of UGT1A transcripts that have more than one first exon (Fig. 1A). Results from the present study and a previous study indicate that seven canonical UGT1A first exons (i.e., 1A1, 1A3, 1A4, 1A5, 1A8, 1A9, 1A10) harbor cryptic donor splice sites (Fig. 2B). Two UGT1A first exons (i.e., 1A3, 1A6) also contain cryptic acceptor splice sites that are used to generate variant transcripts (i.e., 1A3_n1, 1A6_n1, 1A6_n2) (Tourancheau et al., 2016). These observations demonstrate that the activation of cryptic acceptor and donor splice sites within the nine canonical UGT1A first exons constitutes an important mechanism that expands the UGT1A transcriptome and proteome. However, the low abundance of the transcripts generated using these cryptic splice sites suggests that they are much weaker sites compared with their canonical splice sites. The one exception is the 1A8_n2 donor splice site showing activity that approached or even surpassed that of the 1A8 canonical donor splice site in many normal and cancer colorectal tissues. We used splice site prediction algorithms to assess the strength of all potential donor splice sites within the 1A8 first exon (Reese et al., 1997). As expected, the 1A8_n2 cryptic donor splice site scored lower than the canonical donor splice site; however, it also scored lower than other predicted donor splice sites that we did not find variants derived from these sites (data not shown). It is unclear why the 1A8_n2 cryptic donor splice site is used at high frequency when stronger predicted splice donor sites are not used in colorectal tissues. We cannot rule out the possibility that the predicted stronger donor splice sites may be active in noncolorectal tissues.
Both canonical and variant transcripts of the same gene are generated using the same pre-mRNA; hence, the generation of variant transcripts can reduce the amount of the pre-mRNA available for processing into the canonical transcript. The UGT1A_v2/3 levels were estimated to be less than 10% of UGT1A_v1 levels in liver (Jones et al., 2012; Tourancheau et al., 2018). Similarly, most of the variants reported in the present study were expressed in liver (1A1_n1, 1A3_n3, 1A3_n4, 1A5_n1), kidney (1A9_n2), or colorectal tissues (1A10_n7) at levels that were generally less than 5% of the levels of their canonical counterparts. Variants with such low abundance may only have a minor impact on the output of their canonical v1 transcripts. In contrast, 1A8_n2 was expressed in almost all normal colorectal tissues and more than half of CRC tumors, and the average 1A8_n2/1A8_v1 ratio was close to 50%. Moreover, in many CRC samples, 1A8_n2 showed equal or higher abundance than 1A8_v1. Thus, variant splicing of the 1A8 pre-mRNA can significantly reduce the amount of functional 1A8 mRNA (1A8_v1) and protein (1A8_i1), representing a novel mechanism to negatively regulate the function of the 1A8 gene at the splicing level. We propose that highly active cryptic splice sites represent important cis-regulatory elements that can negatively regulate the output of the canonical UGT1A_v1s at the splicing level. Cryptic splice sites that are used to generate other abundant UGT variant transcripts (e.g., UGT2A2Δexon3, UGT2B7_n4) in functionally relevant human tissues may have a similar regulatory role (Bushey et al., 2012; Rouleau et al., 2016).
The abundance of 1A8_n2 transcripts in gut prompts further analysis of its biologic function(s) in future studies. Intestinal UGT1A8 is important for the first-pass metabolism of many diet-derived bioactive chemicals (e.g., flavonoids) as well as orally delivered drugs (Cheng et al., 1998; Hu et al., 2014). It was recently shown that gut-derived flavonoid-glucuronides are subject to enterohepatic recycling, increasing their half-life (Zeng et al., 2016). Moreover, glucuronidation may modulate the bioactivity of flavonoids both in the gut and in other tissues to which they are distributed via the systemic circulation (Docampo et al., 2017). There is also evidence that flavonoids such as apigenin and luteolin modulate splicing directly by binding to splicing factors and altering the recognition of weak splice sites (Kurata et al., 2019). We recently found that flavonoids can induce UGT1A8 transcription by increasing promoter activity, suggesting a feedback loop by which flavonoids may regulate their own exposure and activity (data not shown). An intriguing hypothesis is that flavonoids may also regulate the splicing of the pre-UGT1A8 mRNA to further fine-tune this feedback mechanism. Future studies could assess whether flavonoids can regulate the 1A8_n2/1A8_v1 expression ratio and whether this in turn can influence flavonoid metabolism via the glucuronidation pathway.
In conclusion, the activation of cryptic donor splice sites within the UGT1A first-exon region expands the UGT1A transcriptome and proteome. Almost all cryptic splice sites are weak sites that generate rare variant transcripts. However, the 1A8_n2 cryptic donor splice site is highly active in colorectal tissues and likely acts as an important cis-regulatory element that negatively regulates the function of the UGT1A8 gene at the splicing level. Finally, the variant protein 1A8_i3 was unable to inhibit the glucuronidation of 4MU by 1A8_i1; however, it remains to be determined whether the aglycone-binding domain-truncated UGT1A proteins can inhibit the activity of UGT1A_i1s with other substrates.
Data Availability
The RNA-Seq datasets that support the findings of this study are openly available in Sequence Read Archive (https://www.ncbi.nlm.nih.gov/sra). All other data presented are contained within the manuscript and Supplemental Material.
Authorship Contributions
Participated in research design: Hu, Mackenzie, Meech.
Conducted experiments: Hu, Hulin, Ansaar, Meech.
Performed data analysis: Hu, Marri, Hulin, Meech.
Wrote or contributed to the writing of the manuscript: Hu, Marri, Hulin, Mackenzie, McKinnon, Meech.
Footnotes
- Received October 14, 2023.
- Accepted March 18, 2024.
This study was supported by the National Health and Medical Research Council (NHMRC) of Australia [Grant 1143175] (to D.G.H., P.I.M., R.A.M., and R.M.) and the Australia Research Council [Grant DP210103065] (to R.M.).
No author has an actual or perceived conflict of interest with the contents of this article.
↵This article has supplemental material available at dmd.aspetjournals.org.
Abbreviations
- aa
- amino acid
- BLAD
- urinary bladder cancer
- CaptureSeq
- RNA capture sequencing
- CRC
- colorectal cancer
- HCC
- hepatocellular carcinoma
- HEK
- human embryonic kidney
- 4MU
- 4-methylumbelliferone
- 4MUG
- 4-methylumbelliferone glucuronide
- nt
- nucleotide
- ORF
- open reading frame
- PCR
- polymerase chain reaction
- RCC
- renal cell carcinoma
- RNA-Seq
- RNA sequencing
- SRA
- Sequence Read Archive
- UGT
- UDP-glucuronosyltransferase
- Copyright © 2024 by The American Society for Pharmacology and Experimental Therapeutics