共查询到20条相似文献,搜索用时 15 毫秒
1.
A set of 43 337 splice junction pairs was extracted from mammalian GenBank annotated genes. Expressed sequence tag (EST) sequences support 22 489 of them. Of these, 98.71% contain canonical dinucleotides GT and AG for donor and acceptor sites, respectively; 0.56% hold non-canonical GC-AG splice site pairs; and the remaining 0.73% occurs in a lot of small groups (with a maximum size of 0.05%). Studying these groups we observe that many of them contain splicing dinucleotides shifted from the annotated splice junction by one position. After close examination of such cases we present a new classification consisting of only eight observed types of splice site pairs (out of 256 a priori possible combinations). EST alignments allow us to verify the exonic part of the splice sites, but many non-canonical cases may be due to intron sequencing errors. This idea is given substantial support when we compare the sequences of human genes having non-canonical splice sites deposited in GenBank by high throughput genome sequencing projects (HTG). A high proportion (156 out of 171) of the human non-canonical and EST-supported splice site sequences had a clear match in the human HTG. They can be classified after corrections as: 79 GC-AG pairs (of which one was an error that corrected to GC-AG), 61 errors that were corrected to GT-AG canonical pairs, six AT-AC pairs (of which two were errors that corrected to AT-AC), one case was produced from non-existent intron, seven cases were found in HTG that were deposited to GenBank and finally there were only two cases left of supported non-canonical splice sites. If we assume that approximately the same situation is true for the whole set of annotated mammalian non-canonical splice sites, then the 99.24% of splice site pairs should be GT-AG, 0.69% GC-AG, 0.05% AT-AC and finally only 0.02% could consist of other types of non-canonical splice sites. We analyze several characteristics of EST-verified splice sites and build weight matrices for the major groups, which can be incorporated into gene prediction programs. We also present a set of EST-verified canonical splice sites larger by two orders of magnitude than the current one (22 199 entries versus approximately 600) and finally, a set of 290 EST-supported non-canonical splice sites. Both sets should be significant for future investigations of the splicing mechanism. 相似文献
2.
Characterization and prediction of alternative splice sites 总被引:9,自引:0,他引:9
Human alternative isoform, cryptic, skipped, and constitutive splice sites from the ALTEXTRON database were analysed regarding splice site strength, composition, GC content, position and binding site strength of polypyrimidine tract and branch site. Several features were identified which distinguish alternative isoform and cryptic splice sites, but not skipped splice sites from constitutive ones. These include splice site strength, introns GC content, U2AF35 binding site score, and oligonucleotide frequencies. For the predictive classification of splice sites, pattern recognition models for different splicing factor binding sites and oligonucleotide frequency models (OFMs) were combined using backpropagation networks. 67.45% of acceptor sites and 71.23% of donor sites are correctly classified by networks trained for classification of constitutive and alternative isoform/cryptic splice sites. A web-application for the prediction of alternative splice sites is available at http://es.embnet.org/~mwang/assp.html . 相似文献
3.
Group I introns have been engineered into trans-splicing ribozymes capable of replacing the 3'-terminal portion of an external mRNA with their own 3'-exon. Although this design makes trans-splicing ribozymes potentially useful for therapeutic application, their trans-splicing efficiency is usually too low for medical use. One factor that strongly influences trans-splicing efficiency is the position of the target splice site on the mRNA substrate. Viable splice sites are currently determined using a biochemical trans-tagging assay. Here, we propose a rapid and inexpensive alternative approach to identify efficient splice sites. This approach involves the computation of the binding free energies between ribozyme and mRNA substrate. We found that the computed binding free energies correlate well with the trans-splicing efficiency experimentally determined at 18 different splice sites on the mRNA of chloramphenicol acetyl transferase. In contrast, our results from the trans-tagging assay correlate less well with measured trans-splicing efficiency. The computed free energy components suggest that splice site efficiency depends on the following secondary structure rearrangements: hybridization of the ribozyme's internal guide sequence (IGS) with mRNA substrate (most important), unfolding of substrate proximal to the splice site, and release of the IGS from the 3'-exon (least important). The proposed computational approach can also be extended to fulfill additional design requirements of efficient trans-splicing ribozymes, such as the optimization of 3'-exon and extended guide sequences. 相似文献
4.
An approach of encoding for prediction of splice sites using SVM 总被引:1,自引:0,他引:1
In splice sites prediction, the accuracy is lower than 90% though the sequences adjacent to the splice sites have a high conservation. In order to improve the prediction accuracy, much attention has been paid to the improvement of the performance of the algorithms used, and few used for solving the fundamental issues, namely, nucleotide encoding. In this paper, a predictor is constructed to predict the true and false splice sites for higher eukaryotes based on support vector machines (SVM). Four types of encoding, which were mono-nucleotide (MN) encoding, MN with frequency difference between the true sites and false sites (FDTF) encoding, Pair-wise nucleotides (PN) encoding and PN with FDTF encoding, were applied to generate the input for the SVM. The results showed that PN with FDTF encoding as input to SVM led to the most reliable recognition of splice sites and the accuracy for the prediction of true donor sites and false sites were 96.3%, 93.7%, respectively, and the accuracy for predicting of true acceptor sites and false sites were 94.0%, 93.2%, respectively. 相似文献
5.
Background
gene identification in genomic DNA sequences by computational methods has become an important task in bioinformatics and computational gene prediction tools are now essential components of every genome sequencing project. Prediction of splice sites is a key step of all gene structural prediction algorithms. 相似文献6.
We present here a new algorithm for functional site analysis. It is based on four main assumptions: each variation of nucleotide
composition makes a different contribution to the overall binding free energy of interaction between a functional site and
another molecule; nonfunctioning site-like regions (pseudosites) are absent or rare in genomes; there may be errors in the
sample of sites; and nucleotides of different site positions are considered to be mutually dependent. In this algorithm, the
site set is divided into subsets, each described by a certain consensus. Donor splice sites of the human protein-coding genes
were analyzed. Comparing the results with other methods of donor splice site prediction has demonstrated a more accurate prediction
of consensus sequences AG/GU(A,G), G/GUnAG, /GU(A,G)AG, /GU(A,G)nGU, and G/GUA than is achieved by weight matrix and consensus
(A,C)AG/GU(A,G)AGU with mismatches. The probability of the first type error, E1, for the obtained consensus set was about
0.05, and the probability of the second type error, E2, was 0.15. The analysis demonstrated that accuracy of the functional
site prediction could be improved if one takes into account correlations between the site positions. The accuracy of prediction
by using human consensus sequences was tested on sequences from different organisms. Some differences in consensus sequences
for the plant Arabidopsis sp., the invertebrate Caenorhabditis sp., and the fungus Aspergillus sp. were revealed. For the yeast Saccharomyces sp. only one conservative consensus, /GUA(U,A,C)G(U,A,C), was revealed (E1 = 0.03, E2 = 0.03). Yeast is a very interesting
model to use for analysis of molecular mechanisms of splicing.
Received: 14 October 1996 / Accepted: 30 January 1997 相似文献
7.
Positional characterisation of false positives from computational prediction of human splice sites 总被引:3,自引:2,他引:1
Thanaraj TA 《Nucleic acids research》2000,28(3):744-754
The performance of computational tools that can predict human splice sites are reviewed using a test set of EST-confirmed splice sites. The programs (namely HMMgene, NetGene2, HSPL, NNSPLICE, SpliceView and GeneID-3) differ from one another in the degree of discriminatory information used for prediction. The results indicate that, as expected, HMMgene and NetGene2 (which use global as well as local coding information and splice signals) followed by HSPL (which uses local coding information and splice signals) performed better than the other three programs (which use only splice signals). For the former three programs, one in every three false positive splice sites was predicted in the vicinity of true splice sites while only one in every 12 was expected to occur in such a region by chance. The persistence of this observation for programs (namely FEXH, GRAIL2, MZEF, GeneID-3, HMMgene and GENSCAN) that can predict all the potential exons (including optimal and sub-optimal) was assessed. In a high proportion (>50%) of the partially correct predicted exons, the incorrect exon ends were located in the vicinity of the real splice sites. Analysis of the distribution of proximal false positives indicated that the splice signals used by the algorithms are not strong enough to discriminate particularly those false predictions that occur within ± 25 nt around the real sites. It is therefore suggested that specialised statistics that can discriminate real splice sites from proximal false positives be incorporated in gene prediction programs. 相似文献
8.
9.
Elisabeth Oppliger Leibundgut Bendicht Wermuth Jean-Pierre Colombo Sabina Liechti-Gallati 《Human genetics》1996,97(2):209-213
Ornithine transcarbamylase (OTC) deficiency, the most common inborn error of the urea cycle, shows X-linked inheritance with frequent new mutations. Using polymerase chain reaction (PCR) amplification of the individual exons including adjacent intron sequences followed by direct sequencing of the amplimers we identified four new mutations affecting donor splice sites of introns 2, 5, 6, and 8. The mutation at the first position of intron 2 was a G to A exchange associated with acute neonatal hyperammonemia in a male patient at the age of 5 months. A G to C substitution in intron 5 was detected in a boy who developed 2 days after birth hypotonia, and respiratory distress, followed by severe hyperammonemia and terminal coma. The intron 6 mutation, a G to T substitution, was detected in a girl presenting with first episodes of vomiting and agitation at the age of 2 months. The mutation in intron 8, also a G to T transition, caused fatal hyperammonemia and early death at the age of 15 days in a male patient. We present four donor splice site mutations resulting in severe neonatal or very early onset of the disease in three boys and in one female patient. As the GT dinucleotide of the 5 donor splice site is invariant and required for correct splicing the described mutations may lead to improperly spliced mRNAs and aberrant gene products. 相似文献
10.
11.
An analysis of the characteristic properties of sugar binding sites was performed on a set of 19 sugar binding proteins. For each site six parameters were evaluated: solvation potential, residue propensity, hydrophobicity, planarity, protrusion and relative accessible surface area. Three of the parameters were found to distinguish the observed sugar binding sites from the other surface patches. These parameters were then used to calculate the probability for a surface patch to be a carbohydrate binding site. The prediction was optimized on a set of 19 non-homologous carbohydrate binding structures and a test prediction was carried out on a set of 40 protein-carbohydrate complexes. The overall accuracy of prediction achieved was 65%. Results were in general better for carbohydrate-binding enzymes than for the lectins, with a rate of success of 87%. 相似文献
12.
Long interspersed elements (LINEs) are transposable elements that exist in many kinds of eukaryotic genomes, where they have a large effect on genome evolution. There are several thousands to hundreds of thousands of LINE copies in each eukaryotic genome. LINE elements are amplified by a mechanism called retrotransposition, in which a LINE-encoded protein reverse transcribes (copies) its own RNA. We previously isolated two retrotransposition-competent LINEs, ZfL2-1 and ZfL2-2, from zebrafish. Although it has generally been thought that LINEs do not have ‘introns’ (because the LINE RNA is used as the template during retrotransposition), we now show that these two LINEs contain multiple putative functional splice sites. We further show that at least one pair of these splice sites is actually functional in zebrafish cells. Moreover, some of these splice sites are coupled with the splicing signal of a host endogenous gene, thereby generating a new chimeric spliced mRNA variant for this gene. Our results suggest the possible role of these LINE splice sites in modulating retrotransposition and host gene expression. 相似文献
13.
Kapustin Y Chan E Sarkar R Wong F Vorechovsky I Winston RM Tatusova T Dibb NJ 《Nucleic acids research》2011,39(14):5837-5844
We describe a new program called cryptic splice finder (CSF) that can reliably identify cryptic splice sites (css), so providing a useful tool to help investigate splicing mutations in genetic disease. We report that many css are not entirely dormant and are often already active at low levels in normal genes prior to their enhancement in genetic disease. We also report a fascinating correlation between the positions of css and introns, whereby css within the exons of one species frequently match the exact position of introns in equivalent genes from another species. These results strongly indicate that many introns were inserted into css during evolution and they also imply that the splicing information that lies outside some introns can be independently recognized by the splicing machinery and was in place prior to intron insertion. This indicates that non-intronic splicing information had a key role in shaping the split structure of eukaryote genes. 相似文献
14.
Prediction of splice sites in non-coding regions of genes is one of the most challenging aspects of gene structure recognition. We perform a rigorous analysis of such splice sites embedded in human 5' untranslated regions (UTRs), and investigate correlations between this class of splice sites and other features found in the adjacent exons and introns. By restricting the training of neural network algorithms to 'pure' UTRs (not extending partially into protein coding regions), we for the first time investigate the predictive power of the splicing signal proper, in contrast to conventional splice site prediction, which typically relies on the change in sequence at the transition from protein coding to non-coding. By doing so, the algorithms were able to pick up subtler splicing signals that were otherwise masked by 'coding' noise, thus enhancing significantly the prediction of 5' UTR splice sites. For example, the non-coding splice site predicting networks pick up compositional and positional bias in the 3' ends of non-coding exons and 5' non-coding intron ends, where cytosine and guanine are over-represented. This compositional bias at the true UTR donor sites is also visible in the synaptic weights of the neural networks trained to identify UTR donor sites. Conventional splice site prediction methods perform poorly in UTRs because the reading frame pattern is absent. The NetUTR method presented here performs 2-3-fold better compared with NetGene2 and GenScan in 5' UTRs. We also tested the 5' UTR trained method on protein coding regions, and discovered, surprisingly, that it works quite well (although it cannot compete with NetGene2). This indicates that the local splicing pattern in UTRs and coding regions is largely the same. The NetUTR method is made publicly available at www.cbs.dtu.dk/services/NetUTR. 相似文献
15.
The rapidly increasing volume of sequence and structure information available for proteins poses the daunting task of determining their functional importance. Computational methods can prove to be very useful in understanding and characterizing the biochemical and evolutionary information contained in this wealth of data, particularly at functionally important sites. Therefore, we perform a detailed survey of compositional and evolutionary constraints at the molecular and biological function level for a large set of known functionally important sites extracted from a wide range of protein families. We compare the degree of conservation across different functional categories and provide detailed statistical insight to decipher the varying evolutionary constraints at functionally important sites. The compositional and evolutionary information at functionally important sites has been compiled into a library of functional templates. We developed a module that predicts functionally important columns (FIC) of an alignment based on the detection of a significant "template match score" to a library template. Our template match score measures an alignment column's similarity to a library template and combines a term explicitly representing a column's residue composition with various evolutionary conservation scores (information content and position-specific scoring matrix-derived statistics). Our benchmarking studies show good sensitivity/specificity for the prediction of functional sites and high accuracy in attributing correct molecular function type to the predicted sites. This prediction method is based on information derived from homologous sequences and no structural information is required. Therefore, this method could be extremely useful for large-scale functional annotation. 相似文献
16.
17.
18.
19.
The accuracy of the data we reported in an RNA Letter to the Editor earlier this year on the possible relationship between stop codons and splicing is questioned by Miriami et al. (this issue). We reply here that we see no inaccuracy in our data presentation and offer a possible explanation for their interpretation. 相似文献