首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
3.
Untranslated gene regions (UTRs) play an important role in controlling gene expression. 3′-UTRs are primarily targeted by microRNA (miRNA) molecules that form complex gene regulatory networks. Cancer genomes are replete with non-coding mutations, many of which are connected to changes in tumor gene expression that accompany the development of cancer and are associated with resistance to therapy. Therefore, variants that occurred in 3′-UTR under cancer progression should be analysed to predict their phenotypic effect on gene expression, e.g., by evaluating their impact on miRNA target sites. Here, we analyze 3′-UTR variants in DICER1 and DROSHA genes in the context of myelodysplastic syndrome (MDS) development. The key features of this analysis include an assessment of both “canonical” and “non-canonical” types of mRNA-miRNA binding and tissue-specific profiling of miRNA interactions with wild-type and mutated genes. As a result, we obtained a list of DICER1 and DROSHA variants likely altering the miRNA sites and, therefore, potentially leading to the observed tissue-specific gene downregulation. All identified variants have low population frequency consistent with their potential association with pathology progression.  相似文献   

4.
MicroRNAs (miRNAs) are small RNAs that regulate the expression of target mRNAs by specific binding on the mRNA 3''UTR and promoting mRNA degradation in the majority of cases. It is often of interest to know the specific targets of a miRNA in order to study them in a particular disease context. In that sense, some databases have been designed to predict potential miRNA-mRNA interactions based on hybridization sequences. However, one of the main limitations is that these databases have too many false positives and do not take into account disease-specific interactions. We have developed an R package (miRComb) able to combine miRNA and mRNA expression data with hybridization information, in order to find potential miRNA-mRNA targets that are more reliable to occur in a specific physiological or disease context. This article summarizes the pipeline and the main outputs of this package by using as example TCGA data from five gastrointestinal cancers (colon cancer, rectal cancer, liver cancer, stomach cancer and esophageal cancer). The obtained results can be used to develop a huge number of testable hypotheses by other authors. Globally, we show that the miRComb package is a useful tool to deal with miRNA and mRNA expression data, that helps to filter the high amount of miRNA-mRNA interactions obtained from the pre-existing miRNA target prediction databases and it presents the results in a standardised way (pdf report). Moreover, an integrative analysis of the miRComb miRNA-mRNA interactions from the five digestive cancers is presented. Therefore, miRComb is a very useful tool to start understanding miRNA gene regulation in a specific context. The package can be downloaded in http://mircomb.sourceforge.net.  相似文献   

5.
6.
7.
Animal microRNA (miRNA) target prediction is still a challenge, although many prediction programs have been exploited. MiRNAs exert their function through partially binding the messenger RNAs (mRNAs; likely at 3′ untranslated regions [3′UTRs]), which makes it possible to detect the miRNA-mRNA interactions in vitro by co-transfection of miRNA and a luciferase reporter gene containing the target mRNA fragment into mammalian cells under a dual-luciferase assay system. Here, we constructed a human miRNA expression library and used a dual-luciferase assay system to perform large-scale screens of interactions between miRNAs and the 3′UTRs of seven genes, which included more than 3,000 interactions with triplicate experiments for each interaction. The screening results showed that the 3′UTR of one gene can be targeted by multiple miRNAs. Among the prediction algorithms, a Bayesian phylogenetic miRNA target identification algorithm and a support vector machine (SVM) presented a relatively better performance (27% for EIMMo and 24.7% for miRDB) against the average precision (17.3%) of the nine prediction programs used here. Additionally, we noticed that a relatively high conservation level was shown at the miRNA 3′ end targeted regions, as well as the 5′ end (seed region) binding sites.  相似文献   

8.
9.
10.
11.
12.
13.
14.
Tomato Genomic Resources Database (TGRD) allows interactive browsing of tomato genes, micro RNAs, simple sequence repeats (SSRs), important quantitative trait loci and Tomato-EXPEN 2000 genetic map altogether or separately along twelve chromosomes of tomato in a single window. The database is created using sequence of the cultivar Heinz 1706. High quality single nucleotide polymorphic (SNP) sites between the genes of Heinz 1706 and the wild tomato S. pimpinellifolium LA1589 are also included. Genes are classified into different families. 5′-upstream sequences (5′-US) of all the genes and their tissue-specific expression profiles are provided. Sequences of the microRNA loci and their putative target genes are catalogued. Genes and 5′-US show presence of SSRs and SNPs. SSRs located in the genomic, genic and 5′-US can be analysed separately for the presence of any particular motif. Primer sequences for all the SSRs and flanking sequences for all the genic SNPs have been provided. TGRD is a user-friendly web-accessible relational database and uses CMAP viewer for graphical scanning of all the features. Integration and graphical presentation of important genomic information will facilitate better and easier use of tomato genome. TGRD can be accessed as an open source repository at http://59.163.192.91/tomato2/.  相似文献   

15.
16.
17.
18.
Fran Supek  Tomislav ?muc 《Genetics》2010,185(3):1129-1134
A recent investigation concluded that codon bias did not affect expression of green fluorescent protein (GFP) variants in Escherichia coli, while stability of an mRNA secondary structure near the 5′ end played a dominant role. We demonstrate that combining the two variables using regression trees or support vector regression yields a biologically plausible model with better support in the GFP data set and in other experimental data: codon usage is relevant for protein levels if the 5′ mRNA structures are not strong. Natural E. coli genes had weaker 5′ mRNA structures than the examined set of GFP variants and did not exhibit a correlation between the folding free energy of 5′ mRNA structures and protein expression.IN genomes, natural selection may act on silent sites of codons to make translation of highly expressed genes more efficient, an effect linked primarily to abundances of tRNA isoacceptor molecules (Ikemura 1985; Bulmer 1987; Kanaya et al. 1999). Codon choice may also be linked to formation of secondary structures in mRNA that reduce protein levels, as has been shown with haplotypes of the human COMT gene (Nackley et al. 2006). Kudla et al. (2009) have recently reported an experiment that contributes toward understanding how synonymous codon usage shapes gene expression. They have constructed a library of 154 synthetic variants of a green fluorescent protein (GFP) gene that varied randomly at synonymous sites while retaining the original amino acid sequence. The authors concluded that codon usage (CU) bias did not correlate with protein levels measured as fluorescence of the GFP, but also that the minimum free energy of a mRNA secondary structure in a 42-nucleotide region at [−4,37] that overlaps the start codon (“hairpin stability”) bears a great significance. CU bias was quantified by the widely used codon adaptation index (CAI) method (Sharp and Li 1987), essentially a measure of the distance of a gene''s codon usage to the codon usage of a predefined set of highly expressed genes. The CAI and some of its more recent alternatives, such as measure independent of length and composition (MILC) (Supek and Vlahovicek 2005), have been shown to be a viable surrogate for gene expression in various unicellular organisms. Additionally, in a multiple linear regression of rank fluorescence against a number of sequence-derived attributes, including CAI and the abovementioned hairpin stability, Kudla et al. (2009) did not find CAI to contribute significantly toward the prediction of protein levels, in contrast to the hairpin stability.

Both the codon adaptation index and the 5′ mRNA secondary structures influence protein levels in the Kudla et al. data:

The described statistical analyses, however, failed to address the case in which a nonlinear three-way dependency between hairpin stability, codon usage, and fluorescence might exist; data are visualized in Figure 1, A–C, and in figure 2B in Kudla et al. Such complex patterns in data are readily captured by the support vector machines (SVM) algorithm, reviewed in Noble (2006) and Ben-Hur et al. (2008). We have employed the SVM with a radial basis function kernel to regress fluorescence against both hairpin stability and CAI simultaneously (Figure 1B) and computed the Pearson''s correlation coefficient in cross-validation (here denoted as Q) between true and predicted values of fluorescence (See File S1). A linear model based solely on hairpin stability as employed by Kudla et al. (Figure 1A) can explain Q2 = 38.6% of variance in protein levels, while the nonlinear SVM regression that takes CAI into account explains Q2 = 52.2% of variance. The difference in Q is statistically significant at P = 10−190 (paired t-test). Note that Kudla et al. utilize the Spearman rank correlation coefficient (ρ) in their article; the hairpin stability would explain ρ2 = 44.6% of the variance in expression levels if the requirement for a linear relationship was abandoned in this manner.Open in a separate windowFigure 1.—Regression of protein levels against folding free energy of an mRNA hairpin at nucleotides −4 through 37 (A), against the hairpin free energy and the codon adaptation index (Sharp and Li 1987) (B and C), or against the hairpin free energy and the codon frequencies (D and E). The colors show the measured protein levels, while the background shading reflects the protein levels predicted by the specific model. (A) Predictions by linear regression. (B and E) Predictions by a support vector machine with a radial basis function kernel. (C) Predictions by an M5′ regression tree. (D) A schematic of the M5′ model, where coefficients in the terminal nodes are derived from data where protein levels, all codon frequencies, and hairpin free energies were normalized to [0,1] to facilitate comparison between the influence of codons, the hairpin stability, and the constant in the regression equation. All coefficients ≥0.1 are in boldface type. In the plots (A–C and E), a slight amount of random “jitter” was introduced to the point positions (at most, 3% of the range of each axis) to better visualize overlapping points. In the plot in E, a single outlying point is not shown. See Figure S2 for the same plots without jitter and with the outlier in E included. R2 is the squared Pearson''s correlation coefficient between actual and model-predicted protein levels; Q2 is similar, but obtained in cross-validation (10-fold, 100 runs), and is a more conservative estimate of regression accuracy.Open in a separate windowFigure 2.—The distributions of RNA folding free energies of a 42-nucleotide window in the mRNA between positions −4 and +37, where the “A” in the “AUG” start codon has index zero. The distributions are shown separately for the 154 gene variants from Kudla et al. (2009) and for the genes from the E. coli K12 genome. The dotted line indicates the 5th percentile of the E. coli values at −10.9 kcal/mol.Compared to the SVM, a more interpretable generalization of the data can be achieved by a different nonlinear regression approach, the M5′ tree (Wang and Witten 1997), which recursively divides the data to reduce the variance of the dependant variable within each partition and then builds separate linear models for the partitions. The resulting regression tree (Figure 1C; supporting information, Figure S1) better explains the correlation between protein levels on one side and hairpin stability and CAI on the other side when compared to a linear model employed by Kudla et al. that regresses protein levels against hairpin stability only [see figure 2B in Kudla et al. (2009) and Figure 1A]; 9.3% more variance is explained by the M5′, P = 10−91 (paired t-test). An interpretation that follows from the general structure of the M5′ tree (Figure S1) is that, at high mRNA hairpin stability, protein levels will generally be quite low and not dependant on CAI; in contrast, with less stable mRNA hairpins, both hairpin stability and CAI play a role in determining protein levels. In the interpretation of the M5′ tree structure, we would place less emphasis on the exact coefficients of the linear models in the leaves because the reliability of these fine-grained features of the M5′ model can strongly depend on the good coverage of all parts of the mRNA–CAI space data points.

The CAI may not be an optimal summary of codon usage for predicting expression of overexpressed genes:

Regarding use of CAI in the present context, it should be noted that CAI''s original purpose was to serve as a proxy for gene expression in conditions of abundance that result in fast growth in the organism''s environmental niche. The CAI or related approaches (Supek and Vlahovicek 2005) may not, however, be an ideal representation of codon usage when examining overexpression of a foreign protein at levels that exceed the natural abundances of the host''s most highly expressed proteins. This was indeed shown to be the case in a recent article by Welch et al. (2009) in which the authors reported an experiment with heterologous expression of variants of two proteins in E. coli: an antibody fragment and a phage DNA polymerase. Welch et al. found that codon frequencies in general, but not CAI specifically, correlated well with protein levels and postulated that for overexpressed proteins optimal codons would correspond to the codons translated efficiently under amino acid starvation (Elf et al. 2003; Dittmar et al. 2005). Analogously to Welch et al., we now apply our regression algorithms not to the CAI, but directly to the codon frequencies that CAI attempts to summarize in the Kudla et al. data (See File S2). An M5′ regression tree trained on the hairpin stability and codon frequencies (Figure 1D) explains 10.6% more variance (P = 10−83, paired t-test) in protein levels than an M5′ tree trained on hairpin stability and CAI (Figure 1C, Figure S1). A SVM regression model trained on the hairpin stability and a simple linear combination of selected codon frequencies (Figure 1E) explains 8.8% more variance (P = 10−82, paired t-test) than the SVM that uses CAI (Figure 1B). An SVM trained on the hairpin stability and the full set of codon frequencies (not shown in Figure 1) explains Q2 = 65.0% of variance in the protein abundances, a sizable increase (P ≈ 10−260, paired t-test) compared to a linear regression on solely the [−4,37] hairpin stability (Q2 = 38.6%) as originally employed by Kudla et al. and also as compared to a set of randomized controls (Q2 = 20.1–30.7%; Table S1). Therefore, not relying on a predefined notion of codon optimality—as embodied in the CAI—further strengthens the argument that the correlation of CU and protein levels is far from negligible in this data set.Additionally, we found some correlation between codon frequencies and 5′ mRNA hairpin stability in the Kudla et al. gene variants (Figure S4). The fact that the two factors were not completely independent adds weight to the relevance of CU to protein levels since one could not be certain that even the variance in protein levels explained by 5′ mRNA structures is wholly due to the structures themselves and not to the confounding variables—here, the codon frequencies.The M5′ tree trained on codon frequencies (Figure 1D) follows the same general structure as the M5′ tree trained on the CAI (Figure S1) where the codon frequencies become relevant with mRNA hairpins weaker than −9.75 kcal/mol, while with stronger [−4,37] mRNA hairpins protein levels are generally low. Our interpretation is that the lack of a stable secondary structure that could obstruct translational initiation is a necessary but not a sufficient condition for high protein expression. When the initiation phase is unhindered, the bottleneck would shift to the elongation phase in which codon optimality plays an important role. In the literature, theoretical models of translation may consider either the initiation (Bulmer 1991) or the elongation phase (Xia 1998) as the rate-limiting step of translation under physiological conditions; we are not aware of such analyses describing translation of artificially overexpressed genes.The codons identified as relevant by our M5′ model of the Kudla et al. data are different from, but not inconsistent with, those proposed by Welch et al. (Table S2). We anticipate that the rules for codon optimality for overexpression in an Escherichia coli host will become better defined as more large-scale experiments, such as the two discussed here (Kudla et al. 2009; Welch et al. 2009), are carried out.

The “RNA structure + codon usage” model agrees with independent experimental data and is robust to removal of extreme values:

Our reanalysis of the Kudla et al. data should be viewed in light of the conclusions of Welch et al. (2009) who find that codon usage, but not the 5′ hairpin stability, correlates with protein levels in their data, while noting that their gene variants generally have considerably weaker 5′ mRNA hairpins than the sequences in Kudla et al. Welch et al. reconcile the different outcomes of the two experiments by noting that “inhibition of initiation by especially strong mRNA structure would obscure effects resulting from factors that influence elongation, such as codon usage” (page 9). Here we propose that precisely the same model can be derived solely from the Kudla et al. data. Furthermore, we find that the 154 gene variants from Kudla et al. indeed do have unusually stable 5′ mRNA hairpins (mean free energy = −9.68 kcal/mol) in comparison to natural E. coli genes (mean free energy = −6.15 kcal/mol) (P = 10−38 by Mann–Whitney U-test; see Figure 2). The part of the distribution of Kudla et al. gene variants that overlaps with the bulk of the E. coli genes, with 5′ mRNA hairpin free energies lower than ∼ −10 kcal/mol, corresponds to the range where our M5′ model indicates a stronger influence of CU on protein levels (Figure S1, Figure 1D).We investigate to what extent the presence of a group of sequences extreme in their 5′ mRNA hairpin stabilities in the Kudla et al. data set (left peak in Figure 2) influenced the authors'' conclusion that the hairpin stabilities have an overarching influence on protein levels. After removing the sequences below the 5th percentile of the E. coli natural hairpin stabilities (−10.9 kcal/mol), we were left with 109 of the original 154 Kudla et al. sequences. The accuracy of regressing protein levels against mRNA hairpin stability deteriorates greatly (Q2 = 18.5%) after removing the 45 sequences, but less so with SVM and M5′ regression that take into account both CU and the hairpin stability (udla et al. basically captured the difference between these extreme cases—in which very strong 5′ mRNA secondary structures blocked expression—and all other sequences. However, to explain the variation in protein levels within the nonextreme set, hairpin stabilities by themselves are not sufficient and need to be complemented with CU.

TABLE 1

Accuracy of the regression of protein levels against 5′ mRNA hairpin stability or against 5′ mRNA hairpin stability and codon frequencies
Data setLinear regression, hairpin stability only (%)SVM, hairpin stability + codon frequencies (%)M5′, hairpin stability + codon frequencies (%)
Full (n = 154)38.665.056.7
No strong hairpins (n = 109)18.553.040.4
Open in a separate windowThe cross-validation correlation coefficient squared (Q2) is compared with the full Kudla et al. data set (154 proteins) and the reduced data set (109 proteins) where mRNA hairpin folding energies are ≥ −10.9 kcal/mol, the 5th percentile of natural E. coli genes.In addition to measuring protein levels in the 154-sequence data set, Kudla et al. performed an additional experiment where an unstructured 28-codon tag was fused to 5′ ends of 72 (of 154) GFP sequence variants. Adding the tag was found to enhance protein levels, supporting the conclusion of Kudla et al. that 5′ structure of mRNA had a strong influence on protein production. After an analysis of the data, we found (see File S3) that data from this specific experiment are not well suited to serve as a direct verification of our existing M5′ and SVM regression models. Still, we can compare the protein level predictions of our existing SVM model on the same set of sequences before and after adding the unstructured tag. We found that the predicted expression levels have increased for 67 of 72 sequences (Table S3) after adding the tag that fixes 5′ mRNA folding energy at a weak −6.1 kcal/mol, a result consistent with the Kudla et al. experiment. Additionally, we have trained a new SVM regression model only on the tagged 72-sequence set (See File S2) and found that, within this set, SVM regression can again predict GFP levels solely from codon usage (5′ mRNA structure is invariant among these sequences) at Q2 = 37.7%. This amount of variance is similar, or even somewhat larger than, the difference in the variance explained by mRNA vs. mRNA+codons (38.6% vs. 65.0%) in the original data. Therefore, codon usage is of similar importance in shaping the protein levels within the tagged 72-sequence set, as it was in the original 154-sequence set.

mRNA 5′ end secondary structure stabilities do not correlate with protein levels for natural E.

coli genes: To further verify our proposed model, we analyzed the relative contributions of mRNA hairpin stabilities and CU on expression levels of natural E. coli genes (See File S2). If the hairpin stabilities were limiting for expression in the range of folding free energies spanned by the E. coli mRNAs, one would expect to see a correlation between the free energy of mRNA 5′ end folding and the abundance of the corresponding protein. We found no such correlation using the folding free energies of the [−4,37] mRNA region (Figure 3) or equal-sized regions centered around the start codon at [−20,21] or on the expected location of a Shine–Dalgarno sequence (Shultzaberger et al. 2001) at [−30,11] (see Figure S3). Unsurprisingly, CAI correlated well with protein levels (Figure 3) in all examined experimental data sets (Lopez-Campistrous et al. 2005; Lu et al. 2007; Ishihama et al. 2008). Therefore, within the boundaries of the mRNA folding free energies spanned by E. coli genes, the CU plays a dominant role in shaping gene expression (or the CU may possibly be shaped by the expression; see Concluding remarks). As for the stronger mRNA hairpins with < −11 kcal/mol, they are present in the Kudla et al. data, but are very rare in the E. coli genome, which could be explained by one of two scenarios: (i) Above a certain threshold, the mRNA hairpin stability may become so detrimental to expression that all the mutants having such hairpins are subject to very strong negative selection and therefore are absent from the genome. And/or (ii) the Kudla et al. data set may not be representative of the genes in the E. coli genome or the mutational processes they undergo; for example, the amino acid sequence of the GFP''s beginning might be unusually conducive to forming RNA hairpins. Unless further analyses prove differently, it seems reasonable to surmise that in natural E. coli genes mRNA secondary structures would shape expression if they were highly stable, consistent with the finding of a universal (albeit not particularly strong) trend toward avoidance of 5′ mRNA structures in genomes (Gu et al. 2010). However, it can also be concluded at this point—and with more confidence—that at lower secondary structure stabilities the CU has an overarching influence on expression. Such a model of expression-related gene sequence determinants in E. coli is fully consistent with our interpretation of the M5′ regression tree that we have derived from the Kudla et al. data.Open in a separate windowFigure 3.—Correlations between the E. coli absolute protein abundances measured in three independent experiments (Lopez-Campistrous et al. 2005; Lu et al. 2007; Ishihama et al. 2008) and the codon adaptation index (CAI) or the free energy of folding of a secondary structure in the mRNA [−4,37] region (in kcal/mol; more negative values denote a more stable RNA secondary structure). “ρ” is the Spearman''s rank correlation coefficient.

Concluding remarks:

We argue that Kudla et al. worked with a set of gene sequences in which strong mRNA secondary structures (that effectively abolished expression) were frequent enough to mask the relevance of codon frequencies on protein levels when examined only with linear regression methods. While mRNA secondary structures can certainly occur when designing synthetic genes, it is highly questionable to what extent Kudla et al.''s conclusion that CU is of little importance for expression would be generally valid for biotechnological applications, especially since we have shown that the influence of CU is nevertheless present even in the Kudla et al. data. What is beyond doubt, however, is that a strong 5′ mRNA secondary structure can be a roadblock in heterologous expression, and therefore the synthetic gene variants harboring such structures should be avoided. The more specific rules regarding the exact location of the hairpin on the gene sequence, the hairpin''s length, or the tolerable levels of folding free energy will have to be established by further experimentation.A recent algorithm for estimating the efficiency of ribosomal binding sites from the mRNA sequence (Salis et al. 2009) explicitly takes into account the folding free energy of RNA secondary structures, along with other factors. When protein overexpression is desired, the conclusions of Welch et al. and (by our reanalysis) the Kudla et al. data indicate that CU should be optimized in addition to the ribosome binding site sequence to ensure that both initiation and elongation phases of translation are free of impediments.On the basis of their results, Kudla et al. also discuss the evolutionary link between the CU of natural genes and the expression levels of proteins for which they code. They propose that selection for translational efficiency acts at a global level in cells; the codons that accelerate elongation would be preferred in a highly expressed gene not because they facilitate production of that particular protein, but to free up ribosomes for the rate-determining initiation phase of translation of the total cellular mRNA pool. Effectively, the flow of causality between CU and expression would be reversed in comparison to the established view. This hypothesis should be critically reevaluated because it depends on the assertion that manipulating a gene''s CU cannot cause protein levels to increase, an assertion poorly supported by the Kudla et al. data.  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号