首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The performance of computational tools that can predict human splice sites are reviewed using a test set of EST-confirmed splice sites. The programs (namely HMMgene, NetGene2, HSPL, NNSPLICE, SpliceView and GeneID-3) differ from one another in the degree of discriminatory information used for prediction. The results indicate that, as expected, HMMgene and NetGene2 (which use global as well as local coding information and splice signals) followed by HSPL (which uses local coding information and splice signals) performed better than the other three programs (which use only splice signals). For the former three programs, one in every three false positive splice sites was predicted in the vicinity of true splice sites while only one in every 12 was expected to occur in such a region by chance. The persistence of this observation for programs (namely FEXH, GRAIL2, MZEF, GeneID-3, HMMgene and GENSCAN) that can predict all the potential exons (including optimal and sub-optimal) was assessed. In a high proportion (>50%) of the partially correct predicted exons, the incorrect exon ends were located in the vicinity of the real splice sites. Analysis of the distribution of proximal false positives indicated that the splice signals used by the algorithms are not strong enough to discriminate particularly those false predictions that occur within ± 25 nt around the real sites. It is therefore suggested that specialised statistics that can discriminate real splice sites from proximal false positives be incorporated in gene prediction programs.  相似文献   

2.
We have developed a computer program which predicts internal exons from naive genomic sequence data and which will run on any IBM-compatible 80286 (or higher) computer. The algorithm searches a sequence for 'spliceable open reading frames' (SORFs), which are open reading frames bracketed by suitable splice-recognition sequences, and then analyzes the region for codon usage. Potential exons are stratified according to the reliability of their prediction, from confidence levels 1 to 5. The program is designed to predict internal exons of length greater than 60 nucleotides. In an analysis of 116 genes of a training set, 384 out of 441 such exons (87.1%) are identified, with 280 (63.5%) of predictions matching the true exon exactly (at both 5' and 3' splice junctions and in the correct reading frame), and with 104 (23.6%) exons matching partially. In a similar analysis of 14 genes in a test set unrelated to the genes used to generate the parameters of the program, 70 out of 80 internal exons greater than 60 bp in length are identified (87.5%), with 47 completely and 23 partially matched. SORFs that partially match true internal exons share at least one splice junction with the exon, or share both splice junctions but are interpreted in an incorrect reading frame. Specificity (the percentage of SORFs that correspond to true exons) varies from 91% at confidence level 1 to 16% at confidence level 5, with an overall specificity of 35-40%. The output displays nucleotide position, confidence level, reading frame phase at the 5' and 3' ends, acceptor and donor sequences and scoring statistics and also gives an amino acid translation of the potential exon. SORFIND compares favourably with other programs currently used to predict protein-coding regions.  相似文献   

3.
Piva F  Principato G 《Gene》2007,393(1-2):81-86
There is ample evidence that prediction of human splice sites can be refined by analyzing the nucleotides surrounding splice sites. This could mean that exon nucleotides over splice sites harbour information for the splicing process in addition to the coding information to specify aminoacids. We analyzed the correlations among the nucleotides lying at the end and at the beginning of all the consecutive human exons to seek relationships among the nucleotides. We have divided the sequences taking into account the phase of interruption. Even though exon sequences are involved in the coding function, we found phase-dependent, specific correlations in the area of exon junctions. These regularities do not give rise to specific motifs, but rather to a phase-specific nucleotide context that could contribute to define the splice site or aid the splicing machinery to join the exon ends. Results provide further evidence that accurate selection of human splice sites likely requires the contribution of exon regulatory sequences.  相似文献   

4.
Prediction of human mRNA donor and acceptor sites from the DNA sequence   总被引:40,自引:0,他引:40  
Artificial neural networks have been applied to the prediction of splice site location in human pre-mRNA. A joint prediction scheme where prediction of transition regions between introns and exons regulates a cutoff level for splice site assignment was able to predict splice site locations with confidence levels far better than previously reported in the literature. The problem of predicting donor and acceptor sites in human genes is hampered by the presence of numerous amounts of false positives: here, the distribution of these false splice sites is examined and linked to a possible scenario for the splicing mechanism in vivo. When the presented method detects 95% of the true donor and acceptor sites, it makes less than 0.1% false donor site assignments and less than 0.4% false acceptor site assignments. For the large data set used in this study, this means that on average there are one and a half false donor sites per true donor site and six false acceptor sites per true acceptor site. With the joint assignment method, more than a fifth of the true donor sites and around one fourth of the true acceptor sites could be detected without accompaniment of any false positive predictions. Highly confident splice sites could not be isolated with a widely used weight matrix method or by separate splice site networks. A complementary relation between the confidence levels of the coding/non-coding and the separate splice site networks was observed, with many weak splice sites having sharp transitions in the coding/non-coding signal and many stronger splice sites having more ill-defined transitions between coding and non-coding.  相似文献   

5.
Artificial neural networks have been combined with a rule based system to predict intron splice sites in the dicot plant Arabidopsis thaliana. A two step prediction scheme, where a global prediction of the coding potential regulates a cutoff level for a local prediction of splice sites, is refined by rules based on splice site confidence values, prediction scores, coding context and distances between potential splice sites. In this approach, the prediction of splice sites mutually affect each other in a non-local manner. The combined approach drastically reduces the large amount of false positive splice sites normally haunting splice site prediction. An analysis of the errors made by the networks in the first step of the method revealed a previously unknown feature, a frequent T-tract prolongation containing cryptic acceptor sites in the 5' end of exons. The method presented here has been compared with three other approaches, GeneFinder, Gene-Mark and Grail. Overall the method presented here is an order of magnitude better. We show that the new method is able to find a donor site in the coding sequence for the jelly fish Green Fluorescent Protein, exactly at the position that was experimentally observed in A.thaliana transformants. Predictions for alternatively spliced genes are also presented, together with examples of genes from other dicots, monocots and algae. The method has been made available through electronic mail (NetPlantGene@cbs.dtu.dk), or the WWW at http://www.cbs.dtu.dk/NetPlantGene.html  相似文献   

6.
An approach of encoding for prediction of splice sites using SVM   总被引:1,自引:0,他引:1  
Huang J  Li T  Chen K  Wu J 《Biochimie》2006,88(7):923-929
In splice sites prediction, the accuracy is lower than 90% though the sequences adjacent to the splice sites have a high conservation. In order to improve the prediction accuracy, much attention has been paid to the improvement of the performance of the algorithms used, and few used for solving the fundamental issues, namely, nucleotide encoding. In this paper, a predictor is constructed to predict the true and false splice sites for higher eukaryotes based on support vector machines (SVM). Four types of encoding, which were mono-nucleotide (MN) encoding, MN with frequency difference between the true sites and false sites (FDTF) encoding, Pair-wise nucleotides (PN) encoding and PN with FDTF encoding, were applied to generate the input for the SVM. The results showed that PN with FDTF encoding as input to SVM led to the most reliable recognition of splice sites and the accuracy for the prediction of true donor sites and false sites were 96.3%, 93.7%, respectively, and the accuracy for predicting of true acceptor sites and false sites were 94.0%, 93.2%, respectively.  相似文献   

7.
Wu Y  Zhang Y  Zhang J 《Genomics》2005,86(3):329-336
Ab initio prediction of functional exon splicing enhancer (ESE) elements based on RNA sequences present a challenge in the evaluation of the functional impacts of human genetic polymorphisms on splicing. To better understand the behavior of ESEs, we studied their distribution in human exons and introns for four known SR protein-binding motifs: SF2/SAF, SC35, SRp40, and SRp55. ESEs are enriched in regions in exons that are close to the splice sites, especially in the region 80 to 120 bases away from the ends of splice acceptor sites. Significant enrichment of ESEs is associated with weak splice acceptor sites but not weak donor sites. ESE density decreases at the 3 ends of long exons. ESEs are also enriched in introns with weak donor or acceptor sites. These characteristics of ESEs may help to predict functional ESE sites in RNA sequences.  相似文献   

8.
Locating protein coding regions in genomic DNA is a critical step in accessing the information generated by large scale sequencing projects. Current methods for gene detection depend on statistical measures of content differences between coding and noncoding DNA in addition to the recognition of promoters, splice sites, and other regulatory sites. Here we explore the potential value of recurrent amino acid sequence patterns 3-19 amino acids in length as a content statistic for use in gene finding approaches. A finite mixture model incorporating these patterns can partially discriminate protein sequences which have no (detectable) known homologs from randomized versions of these sequences, and from short (< or = 50 amino acids) non-coding segments extracted from the S. cerevisiea genome. The mixture model derived scores for a collection of human exons were not correlated with the GENSCAN scores, suggesting that the addition of our protein pattern recognition module to current gene recognition programs may improve their performance.  相似文献   

9.
Automatic annotation of eukaryotic genes,pseudogenes and promoters   总被引:1,自引:0,他引:1  
  相似文献   

10.
Prediction of splice sites in non-coding regions of genes is one of the most challenging aspects of gene structure recognition. We perform a rigorous analysis of such splice sites embedded in human 5' untranslated regions (UTRs), and investigate correlations between this class of splice sites and other features found in the adjacent exons and introns. By restricting the training of neural network algorithms to 'pure' UTRs (not extending partially into protein coding regions), we for the first time investigate the predictive power of the splicing signal proper, in contrast to conventional splice site prediction, which typically relies on the change in sequence at the transition from protein coding to non-coding. By doing so, the algorithms were able to pick up subtler splicing signals that were otherwise masked by 'coding' noise, thus enhancing significantly the prediction of 5' UTR splice sites. For example, the non-coding splice site predicting networks pick up compositional and positional bias in the 3' ends of non-coding exons and 5' non-coding intron ends, where cytosine and guanine are over-represented. This compositional bias at the true UTR donor sites is also visible in the synaptic weights of the neural networks trained to identify UTR donor sites. Conventional splice site prediction methods perform poorly in UTRs because the reading frame pattern is absent. The NetUTR method presented here performs 2-3-fold better compared with NetGene2 and GenScan in 5' UTRs. We also tested the 5' UTR trained method on protein coding regions, and discovered, surprisingly, that it works quite well (although it cannot compete with NetGene2). This indicates that the local splicing pattern in UTRs and coding regions is largely the same. The NetUTR method is made publicly available at www.cbs.dtu.dk/services/NetUTR.  相似文献   

11.
We introduce a new system, called shortHMM, for predicting exons, which predicts individual exons using two related genomes. In this system, we build a hidden semi-Markov model to identify exons. In the hidden Markov model, we propose joint probability models of nucleotides in introns, splice sites, 5'UTR, 3'UTR, and intergenic regions by exploiting the homology between related genomes. In order to reduce the false positive rate of the hidden Markov model, we develop a screening process which is able to identify intergenic regions. We then build a classifier by combining the statistics from the hidden Markov model and the screening process. We implement shortHMM on human-mouse sequence alignments. The source codes are available at < www.stat.purdue.edu/ jingwu/hmm >. Compared to TWINSCAN and SLAM, shortHMM is substantially more powerful in identifying AT-rich RefSeq exons (8% more AT-rich RefSeq exons were predicted), as well as slightly more powerful in identifying RefSeq exons (3-10% more RefSeq exons were predicted), at a similar or lower false positive rate, with less computing time and with less memory usage. Last, shortHMM is also capable of finding new potential exons.  相似文献   

12.
Zhang L  Luo L 《Nucleic acids research》2003,31(21):6214-6220
Based on the conservation of nucleotides at splicing sites and the features of base composition and base correlation around these sites we use the method of increment of diversity combined with quadratic discriminant analysis (IDQD) to study the dependence structure of splicing sites and predict the exons/introns and their boundaries for four model genomes: Caenorhabditis elegans, Arabidopsis thaliana, Drosophila melanogaster and human. The comparison of compositional features between two sequences and the comparison of base dependencies at adjacent or non-adjacent positions of two sequences can be integrated automatically in the increment of diversity (ID). Eight feature variables around a potential splice site are defined in terms of ID. They are integrated in a single formal framework given by IDQD. In our calculations 7 (8) base region around the donor (acceptor) sites have been considered in studying the conservation of nucleotides and sequences of 48 bp on either side of splice sites have been used in studying the compositional and base-correlating features. The windows are enlarged to 16 (donor), 29 (acceptor) and 80 bp (either side) to improve the prediction for human splice sites. The prediction capability of the present method is comparable with the leading splice site detector—GeneSplicer.  相似文献   

13.
A systematic analysis of the RNA splice junction sequences of eukaryotic protein coding genes was carried out using the GENBANK databank. Nucleotide frequencies obtained for the highly conserved regions around the splice sites for different categories of organisms closely agree with each other. A striking similarity among the rare splice junctions which do not contain AG at the 3' splice site or GT at the 5' splice site indicates the existence of special mechanisms to recognize them, and that these unique signals may be involved in crucial gene-regulation events and in differentiation. A method was developed to predict potential exons in a bare sequence, using a scoring and ranking scheme based on nucleotide weight tables. This method was used to find a majority of the exons in selected known genes, and also predicted potential new exons which may be used in alternative splicing situations.  相似文献   

14.
Branch point selection in alternative splicing of tropomyosin pre-mRNAs.   总被引:21,自引:7,他引:14  
The rat tropomyosin 1 gene gives rise to two mRNAs encoding rat fibroblast TM-1 and skeletal muscle beta-tropomyosin via an alternative splicing mechanism. The gene is comprised of 11 exons. Exons 1 through 5 and exons 8 and 9 are common to all mRNAs expressed from this gene. Exons 6 and 11 are used in fibroblasts as well as smooth muscle whereas exons 7 and 10 are used exclusively in skeletal muscle. In the present studies we have focused on the mutually exclusive internal alternative splice choice involving exon 6 (fibroblast-type splice) and exon 7 (skeletal muscle-type splice). To study the mechanism and regulation of alternative splice site selection we have characterized the branch points used in processing of the tropomyosin pre-mRNAs in vitro using nuclear extracts obtained from HeLa cells. Splicing of exon 5 to exon 6 (fibroblast-type splice) involves the use of three branch points located 25, 29, and 36 nucleotides upstream of the 3' splice site of exon 6. Splicing of exon 6 (fibroblast-type splice) or exon 7 (skeletal muscle type-splice) to exon 8 involves the use of the same branch point located 24 nucleotides upstream of this shared 3' splice site. In contrast, the splicing of exon 5 to exon 7 (skeletal muscle-type splice) involves the use of three branch sites located 144, 147 and 153 nucleotides, upstream of the 3' splice site of exon 7. In addition, the pyrimidine content of the region between these unusual branch points and the 3' splice site of exon 7 was found to be greater than 80%. These studies raise the possibility that the use of branch points located a long distance from a 3' splice site may be an essential feature of some alternatively spliced exons. The possible significance of these unusual branch points as well as a role for the polypyrimidine stretch in intron 6 in splice site selection are discussed.  相似文献   

15.
16.
The nucleotide sequence for an unusual, cloned human adenosine deaminase cDNA has been determined. Contained within a sequence of 1535 nucleotides is a coding sequence of 1089 nucleotides that encodes a protein of 40,762 daltons. The coding sequence is interrupted by a non-coding region containing 76 nucleotides. Both the 3' and 5' ends of this region have consensus sequences generally associated with splice sites. The 3' untranslated sequence contained 308 nucleotides, including a polyadenylation signal sequence 20 nucleotides from the end. The cloned cDNA appears to correspond to a nuclear mRNA precursor which contains a small intron.  相似文献   

17.
Alternative 3' and 5' splice site (ss) events constitute a significant part of all alternative splicing events. These events were also found to be related to several aberrant splicing diseases. However, only few of the characteristics that distinguish these events from alternative cassette exons are known currently. In this study, we compared the characteristics of constitutive exons, alternative cassette exons, and alternative 3'ss and 5'ss exons. The results revealed that alternative 3'ss and 5'ss exons are an intermediate state between constitutive and alternative cassette exons, where the constitutive side resembles constitutive exons, and the alternative side resembles alternative cassette exons. The results also show that alternative 3'ss and 5'ss exons exhibit low levels of symmetry (frame-preserving), similar to constitutive exons, whereas the sequence between the two alternative splice sites shows high symmetry levels, similar to alternative cassette exons. In addition, flanking intronic conservation analysis revealed that exons whose alternative splice sites are at least nine nucleotides apart show a high conservation level, indicating intronic participation in the regulation of their splicing, whereas exons whose alternative splice sites are fewer than nine nucleotides apart show a low conservation level. Further examination of these exons, spanning seven vertebrate species, suggests an evolutionary model in which the alternative state is a derivative of an ancestral constitutive exon, where a mutation inside the exon or along the flanking intron resulted in the creation of a new splice site that competes with the original one, leading to alternative splice site selection. This model was validated experimentally on four exons, showing that they indeed originated from constitutive exons that acquired a new competing splice site during evolution.  相似文献   

18.
This study was designed to determine the structure of the gene for glycoprotein (GP) GPIIIa, the beta-subunit of the platelet membrane GPIIb-IIIa complex. The complexity of the gene was determined after Southern analysis of human chromosomal DNA. Overlapping genomic clones were isolated from cosmid and phage lambda libraries that contained the entire coding unit of the human gene for the mature GPIIIa protein. The genomic clones spanned approximately 60 kilobase pairs of human DNA sequence. The exon containing segments of the clones was mapped and the exons, including the exonintron junctions, were sequenced. The GPIIIa protein is divided into 14 exons ranging in size from 87 to 430 nucleotides separated by introns, which were 0.3 to 9 kilobase pairs in size. The 3' exon was larger than 1700 nucleotides and contained the 3'-untranslated region. Several structural domains of the GPIIIa protein were contained within individual exons. These included (i) the transmembrane spanning segment, (ii) the cytoplasmic region containing the potential phosphorylation sites, and (iii) the six domains in the NH2-terminal half of GPIIIa that are highly conserved between two other integrin beta-subunits. In contrast, other domains such as the four cysteine-rich repeats were interrupted by introns. Genomic clones for the beta-subunit of the fibronectin receptor (beta 1) were also isolated, partially mapped, and sequenced. Of the eight splice sites identified in beta 1, six occurred at the same amino acid residue in GPIIIa. These results provide genetic evidence that GPIIIa and beta 1 have a common evolutionary origin within the integrin family.  相似文献   

19.
Prediction of splice site selection and efficiency from sequence inspection is of fundamental interest (testing the current knowledge of requisite sequence features) and practical importance (genome annotation, design of mutant or transgenic organisms). In plants, the dominant variables affecting splice site selection and efficiency include the degree of matching to the extended splice site consensus and the local gradient of U- and G+C-composition (introns being U-rich and exons G+C-rich). We present a novel method for splice site prediction, which was particularly trained for maize and Arabidopsis thaliana. The method extends our previous algorithm based on logitlinear models by considering three variables simultaneously: intrinsic splice site strength, local optimality and fit with respect to the overall splice pattern prediction. We show that the method considerably improves prediction specificity without compromising the high degree of sensitivity required in gene prediction algorithms. Applications to gene identification are illustrated for Arabidopsis and suggest that successful methods must combine scoring for splice sites, coding potential and similarity with potential homologs in non-trivial ways. A WWW version of the SplicePredictor program is available at http:/gnomic.stanford.edu/volker/SplicePredi ctor.html/  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号