首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Liu H  Han H  Li J  Wong L 《In silico biology》2004,4(3):255-269
The translation initiation site (TIS) prediction problem is about how to correctly identify TIS in mRNA, cDNA, or other types of genomic sequences. High prediction accuracy can be helpful in a better understanding of protein coding from nucleotide sequences. This is an important step in genomic analysis to determine protein coding from nucleotide sequences. In this paper, we present an in silico method to predict translation initiation sites in vertebrate cDNA or mRNA sequences. This method consists of three sequential steps as follows. In the first step, candidate features are generated using k-gram amino acid patterns. In the second step, a small number of top-ranked features are selected by an entropy-based algorithm. In the third step, a classification model is built to recognize true TISs by applying support vector machines or ensembles of decision trees to the selected features. We have tested our method on several independent data sets, including two public ones and our own extracted sequences. The experimental results achieved are better than those reported previously using the same data sets. Our high accuracy not only demonstrates the feasibility of our method, but also indicates that there might be "amino acid" patterns around TIS in cDNA and mRNA sequences.  相似文献   

2.
Feature selection for the prediction of translation initiation sites   总被引:3,自引:0,他引:3  
Translation initiation sites (TISs) are important signals in cDNA sequences. In many previous attempts to predict TISs in cDNA sequences, three major factors affect the prediction performance: the nature of the cDNA sequence sets, the relevant features selected. and the classification methods used. In this paper, we examine different approaches to select and integrate relevant features for TIS prediction. The top selected significant features include the features from the position weight matrix and the propensity matrix, the number of nucleotide C in the sequence downstream ATG, the number of downstream stop codons. the number of upstream ATGs, and the number of some amino acids, such as amino acids A and D. With the numerical data generated from these features, different classification methods, including decision tree. naive Bayes, and support vector machine, were applied to three independent sequence sets. The identified significant features were found to be biologically meaningful. while the experiments showed promising results.  相似文献   

3.
4.
MOTIVATION: Prediction of the coding potential for stretches of DNA is crucial in gene calling and genome annotation, where it is used to identify potential exons and to position their boundaries in conjunction with functional sites, such as splice sites and translation initiation sites. The ability to discriminate between coding and non-coding sequences relates to the structure of coding sequences, which are organized in codons, and by their biased usage. For statistical reasons, the longer the sequences, the easier it is to detect this codon bias. However, in many eukaryotic genomes, where genes harbour many introns, both introns and exons might be small and hard to distinguish based on coding potential. RESULTS: Here, we present novel approaches that specifically aim at a better detection of coding potential in short sequences. The methods use complementary sequence features, combined with identification of which features are relevant in discriminating between coding and non-coding sequences. These newly developed methods are evaluated on different species, representative of four major eukaryotic kingdoms, and extensively compared to state-of-the-art Markov models, which are often used for predicting coding potential. The main conclusions drawn from our analyses are that (1) combining complementary sequence features clearly outperforms current Markov models for coding potential prediction in short sequence fragments, (2) coding potential prediction benefits from length-specific models, and these models are not necessarily the same for different sequence lengths and (3) comparing the results across several species indicates that, although our combined method consistently performs extremely well, there are important differences across genomes. SUPPLEMENTARY DATA: http://bioinformatics.psb.ugent.be/.  相似文献   

5.
Recent technological advances have enabled the generation of large amounts of data consisting of RNA sequences and their functional activity. Here, we propose a method for extracting secondary structure features that affect the functional activity of RNA from sequence–activity data. Given pairs of RNA sequences and their corresponding bioactivity values, our method calculates position-specific structural features of the input RNA sequences, considering every possible secondary structure of each RNA. A Ridge regression model is trained using the structural features as feature vectors and the bioactivity values as response variables. Optimized model parameters indicate how secondary structure features affect bioactivity. We used our method to extract intramolecular structural features of bacterial translation initiation sites and self-cleaving ribozymes, and the intermolecular features between rRNAs and Shine–Dalgarno sequences and between U1 RNAs and splicing sites. We not only identified known structural features but also revealed more detailed insights into structure–activity relationships than previously reported. Importantly, the datasets we analyzed here were obtained from different experimental systems and differed in size, sequence length and similarity, and number of RNA molecules involved, demonstrating that our method is applicable to various types of data consisting of RNA sequences and bioactivity values.  相似文献   

6.
A protein-gene linkage map of the cyanobacterium Anabaena sp. strain PCC7120 was successfully constructed for 123 relatively abundant proteins. The total proteins extracted from the cell were resolved by two-dimensional electrophoresis, and the amino-terminal sequences of the protein spots were determined. By comparing the determined amino-terminal sequences with the entire genome sequence, the putative translation initiation sites of 87 genes were successfully assigned on the genome. The elucidated sequence features surrounding the translation initiation sites were as follows: (1) GTG and TTG in addition to the ATG were used as rare initiation codons; (2) the core sequences (GAGG, GGAG and AGGA) of the Shine-Dalgarno sequence were identified in the appropriate position preceding the 51 initiation sites (58.6%); (3) the nucleotides at the two regions, from -35 to -33, and from -19 to -17 (relative to the first nucleotide in the initiation codon) were preferentially adenines or thymines; (4) the nucleotides at the region from -14 to -8 were preferentially purines; (5) the nucleotide at position -1 was biased towards non-guanine (96.6%); (6) the nucleotide at the position +5 was preferentially cytosine (63.2%). It was evident that removal of the translation initiator methionine was dependent on the side-chain bulkiness of the penultimate amino acid residue. The predicted putative signal peptide sequences were also indicated. Besides confirming the existence of many predicted proteins, the data will serve as a starting point for the study of signals important in post-translational processing and nucleotide sequences important in the initiation of translation.  相似文献   

7.
We developed a computer program, GeneHackerTL, which predictsthe most probable translation initiation site for a given nucleotidesequence. The program requires that information be extractedfrom the nucleotide sequence data surrounding the translationinitiation sites according to the framework of the Hidden MarkovModel. Since the translation initiation sites of 72 highly abundantproteins have already been assigned on the genome of Synechocystissp. strain PCC6803 by amino-terminal analysis, we extractednecessary information for GeneHackerTL from the nucleotide sequencedata. The prediction rate of the GeneHackerTL for these proteinswas estimated to be 86.1%. We then used GeneHackerTL for predictionof the translation initiation sites of 24 other proteins, ofwhich the initiation sites were not assigned experimentally,because of the lack of a potential initiation codon at the amino-terminalposition. For 20 out of the 24 proteins, the initiation siteswere predicted in the upstream of their amino-terminal positions.According to this assignment, the processed regions representa typical feature of signal peptides. We could also predictmultiple translation initiation sites for a particular genefor which at least two initiation sites were experimentallydetected. This program would be e.ective for the predictionof translation initiationsites of other proteins, not only inthis species but also in other prokaryotes as well.  相似文献   

8.
9.
We constructed 34 types of human "full-length enriched" and "5'-end enriched" cDNA libraries based on the "Oligo-Capping" method. We randomly picked and sequenced 10,000 clones from these libraries. BLAST analysis showed that about 50% of the cDNAs were identical to known genes. Among them, we selected 954 species of cDNA that should represent the entire sequence from the mRNA start sites. Compared with previously reported sequences, they were on average 45 bp longer in the 5'-end. Using these cDNA data, we statistically analyzed the sequence features of the 5'UTR. The average length of the 5'UTR was 125 bp, and there was little correlation with the corresponding mRNA length (correlation coefficient = 0.26). Of the 954 species of 5'UTR, 459 contained no in-frame terminator codon, which is against the common belief. Two hundred seventy-eight species contained at least one ATG codon upstream of the initiator ATG codon. We identified 569 upstream ATGs, in total, 63% of which adequately satisfied Kozak's criteria. These findings are contrary to the typical translation initiation model, which states that translation is initiated from the "first" ATG codon.  相似文献   

10.
Vagner S  Galy B  Pyronnet S 《EMBO reports》2001,2(10):893-898
Studies on the control of eukaryotic translation initiation by a cap-independent recruitment of the 40S ribosomal subunit to internal messenger RNA sequences called internal ribosome entry sites (IRESs) have shown that these sequence elements are present in a growing list of viral and cellular RNAs. Here we discuss their prevalence, mechanisms whereby they may function and their uses in regulating gene expression.  相似文献   

11.
5' untranslated leaders (5' UTLs) are suggested to play a crucial role in the selective translation of their eukaryotic mRNAs encoding heat shock proteins (HSP) during heat stress conditions. However, the structural features of the HSP mRNAs which cause this effect are mostly unknown. We have compiled the 5' UTLs from about 140 eukaryotic HSP mRNAs including vertebrates, invertebrates, higher and lower plants. A detailed analysis of these sequences according to length, A+T content, context of functional ATGs and presence of upstream non-functional ATGs was made. We observed that all these features were similar to the earlier studies in the literature based on data from HSP as well as non-HSP mRNAs. These observations were reconfirmed by intra-specific comparison of 5' UTLs from HSP and non-HSP genes. Similar to the translation element involved in the selective translation of mRNAs in polioviruses, a search for a short sequence motif complementary to highly conserved 18S rRNA was performed using a HSP mRNA database. The majority of the HSP mRNA sequences (77%) contained one or more small sequence motifs suggesting that they may function as internal ribosome entry sites for selective initiation of translation during heat stress.  相似文献   

12.
To characterize the sequence features surrounding the translationinitiation sites on the genome of Synechocystis sp. strain 6803,the total proteins extracted from the cell were resolved bytwo-dimensional electrophoresis, and the amino-terminal sequencesof the relatively abundant protein spots were determined. Bycomparison of the determined amino-terminal sequences with thenucleotide sequence of the entire genome, the translation initiationsites of a total of 72 proteins were successfully assigned onthe genome. The sequence features emerged from the nucleotidesequences at and surrounding the translation initiation siteswere as follows: (1) In addition to the three initiation codons,ATG, GTG, and TTG, evidence was obtained that ATT was also usedas a rare initiation codon; (2) the core sequences (GAGG, GGAGand AGGA) of the Shine-Dalgarno sequence were identified inthe appropriate position preceding the 35 initiation sites (48.6%);and (3) the preferential sequence surrounding the initiationcodons was formulated as 5'-YY[· · ·]R-3'where Y and R denote pyrimidine and purine nucleotides, respectively,and three dots represent the initiation codons. The result obtainedwould provide valuable information for improvement of the gene-findingsoftware, and the approach used in this study should be applicablefor comprehensive analysis of the expression profiles of cellularproteins.  相似文献   

13.
Integrating information in the molecular biosciences involves more than the cross-referencing of sequences or structures. Experimental protocols, results of computational analyses, annotations and links to relevant literature form integral parts of this information, and impart meaning to sequence or structure. In this review, we examine some existing approaches to integrating information in the molecular biosciences. We consider not only technical issues concerning the integration of heterogeneous data sources and the corresponding semantic implications, but also the integration of analytical results. Within the broad range of strategies for integration of data and information, we distinguish between platforms and developments. We discuss two current platforms and six current developments, and identify what we believe to be their strengths and limitations. We identify key unsolved problems in integrating information in the molecular biosciences, and discuss possible strategies for addressing them including semantic integration using ontologies, XML as a data model, and graphical user interfaces as integrative environments.  相似文献   

14.
Recombinant protein production is a key process in generating proteins of interest in the pharmaceutical industry and biomedical research. However, about 50% of recombinant proteins fail to be expressed in a variety of host cells. Here we show that the accessibility of translation initiation sites modelled using the mRNA base-unpairing across the Boltzmann’s ensemble significantly outperforms alternative features. This approach accurately predicts the successes or failures of expression experiments, which utilised Escherichia coli cells to express 11,430 recombinant proteins from over 189 diverse species. On this basis, we develop TIsigner that uses simulated annealing to modify up to the first nine codons of mRNAs with synonymous substitutions. We show that accessibility captures the key propensity beyond the target region (initiation sites in this case), as a modest number of synonymous changes is sufficient to tune the recombinant protein expression levels. We build a stochastic simulation model and show that higher accessibility leads to higher protein production and slower cell growth, supporting the idea of protein cost, where cell growth is constrained by protein circuits during overexpression.  相似文献   

15.
With the rapid increase of DNA databases of human and other eukaryotic model organisms, a large great number of genes need to be distinguished from the DNA databases. Exact recognition of translation initiation sites (TISs) of eukaryotic genes is very important to understand the translation initiation process, predict the detailed structure of eukaryotic genes, and annotate uncharacterized sequences. The problem has not been solved satisfactorily, especially for recognizing TISs of the eukaryotic genes with shorter first exons. It is an important task for extracting new features and finding new powerful algorithms for recognizing TISs of eukaryotic genes. In this paper, the important characteristics of shorter flanking fragments around TISs are extracted and an expectation-maximization (EM) algorithm based on incomplete data is used to recognize TISs of eukaryotic genes. The accuracy is up to 87.8% over a six-fold cross-validation test. The result shows that the identification variables are effectively extracted and the EM algorithm is a powerful tool to predict the TISs of eukaryotic genes. The algorithm also can be applied to other classification or clustering tasks in bioinformatics.  相似文献   

16.
17.
18.
The initiation of translation is a fundamental and highly regulated process in gene expression. Translation initiation in prokaryotic systems usually requires interaction between the ribosome and an mRNA sequence upstream of the initiation codon, the so-called ribosome-binding site (Shine-Dalgarno sequence). However, a large number of genes do not possess Shine-Dalgarno sequences, and it is unknown how start codon recognition occurs in these mRNAs. We have performed genome-wide searches in various groups of prokaryotes in order to identify sequence elements and/or RNA secondary structural motifs that could mediate translation initiation in mRNAs lacking Shine-Dalgarno sequences. We find that mRNAs without a Shine-Dalgarno sequence are generally less structured in their translation initiation region and show a minimum of mRNA folding at the start codon. Using reporter gene constructs in bacteria, we also provide experimental support for local RNA unfoldedness determining start codon recognition in Shine-Dalgarno--independent translation. Consistent with this, we show that AUG start codons reside in single-stranded regions, whereas internal AUG codons are usually in structured regions of the mRNA. Taken together, our bioinformatics analyses and experimental data suggest that local absence of RNA secondary structure is necessary and sufficient to initiate Shine-Dalgarno--independent translation. Thus, our results provide a plausible mechanism for how the correct translation initiation site is recognized in the absence of a ribosome-binding site.  相似文献   

19.
The translation of poliovirus RNA in rabbit reticulocyte lysate was examined. Translation of poliovirus RNA in this cell-free system resulted in an electrophoretic profile of poliovirus-specific proteins distinct from that observed in vivo or after translation in poliovirus-infected HeLa cell extract. A group of proteins derived from the P3 region of the polyprotein was identified by immunoprecipitation, time course, and N-formyl-[35S]methionine labeling studies to be the product of the initiation of protein synthesis at an internal site(s) located within the 3'-proximal RNA sequences. Utilization of this internal initiation site(s) on poliovirus RNA was abolished when reticulocyte lysate was supplemented with poliovirus-infected HeLa cell extract. Authentic P1-1a was also synthesized in reticulocyte lysate, indicating that correct 5'-proximal initiation of translation occurs in that system. We conclude that the deficiency of a component(s) of the reticulocyte lysate necessary for 5'-proximal initiation of poliovirus protein synthesis resulted in the ability of ribosomes to initiate translation on internal sequences. This aberrant initiation could be corrected by factors present in the HeLa cell extract. Apparently, under certain conditions, ribosomes are capable of recognizing internal sequences as authentic initiation sites.  相似文献   

20.
The previously presented consensus sequence for eukaryotic translation initiation sites by Kozak was derived substantially from vertebrate mRNA sequences. Drosophila nuclear genes exhibit a significantly different translation start consensus sequence. These differences probably do not represent mechanistic differences in translation initiation inasmuch as both taxa exhibit identical preferences and restrictions at the crucial -3 position. Using more conservative criteria for the assignment of consensus the following consensus sequences were derived: vertebrate--CANCAUG and Drosophila--CAAAACAUG.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号