首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到19条相似文献,搜索用时 151 毫秒
1.
新近的基因识别软件比先前的软件有着显著的提高,但是在外显子水平上的敏感性和特异性仍然不十分令人满意.这是因为已有软件对于剪接位点,翻译起始等生物信号位点的识别还不够有效.如果能够分别提高这些生物信号位点的识别效果,就能够提高整体的基因识别效率.隐半马氏模型能够很好地刻画3'剪接位点(acceptor)的结构.据此开发的一套对acceptor进行识别的算法在Burset/Guigo的数据集上经过检验,获得了比已有算法更好的识别率.该模型的成功还使得我们对剪接点上游的分支位点和嘧啶富含区的概貌有了一定的认识,加深了人们对于acceptor的结构和剪接过程的理解.  相似文献   

2.
隐半马氏模型在3′剪接位点识别中的应用(英)   总被引:1,自引:0,他引:1       下载免费PDF全文
新近的基因识别软件比先前的软件有着显著的提高,但是在外显子水平上的敏感性和特异性仍然不十分令人满意.这是因为已有软件对于剪接位点,翻译起始等生物信号位点的识别还不够有效.如果能够分别提高这些生物信号位点的识别效果,就能够提高整体的基因识别效率.隐半马氏模型能够很好地刻画3′剪接位点(acceptor)的结构.据此开发的一套对acceptor进行识别的算法在Burset/Guigo的数据集上经过检验,获得了比已有算法更好的识别率.该模型的成功还使得我们对剪接点上游的分支位点和嘧啶富含区的概貌有了一定的认识,加深了人们对于acceptor的结构和剪接过程的理解.  相似文献   

3.
新近的基因识别软件比先前的软件有着显著的提高 ,但是在外显子水平上的敏感性和特异性仍然不十分令人满意 .这是因为已有软件对于剪接位点 ,翻译起始等生物信号位点的识别还不够有效 .如果能够分别提高这些生物信号位点的识别效果 ,就能够提高整体的基因识别效率 .隐半马氏模型能够很好地刻画 3′剪接位点 (acceptor)的结构 .据此开发的一套对acceptor进行识别的算法在Burset/Guigo的数据集上经过检验 ,获得了比已有算法更好的识别率 .该模型的成功还使得我们对剪接点上游的分支位点和嘧啶富含区的概貌有了一定的认识 ,加深了人们对于acceptor的结构和剪接过程的理解  相似文献   

4.
基于支持向量机(SVM)的剪接位点识别   总被引:14,自引:1,他引:13  
剪接位点的识别作为基因识别中的一个重要环节, 一直受到研究人员的关注。考虑到剪接位点附近存在的序列保守性,已有一些基于统计特性的方法被用于剪接位点的识别中,但效果仍有待进一步改进。支持向量机(Support Vector Machines) 作为一种新的基于统计学习理论的学习机,近几年有了很大的发展,已被应用在模式识别的许多问题中。文中将其用于剪接位点的识别中,并针对满足GT- AG 规则的序列样本中虚假剪接位点的样本数远大于真实位点这一特性, 提出了一种基于SVM 的平衡取小法以获得更好的识别效果。实验结果表明,应用支持向量机进行剪接位点的识别能更好地提取位点附近保守序列的统计特征,对测试集具有更好的推广能力,并且使用上更加简单。这一结果为剪接位点的识别提供了一种新的方法,同时也为生物大分子研究中结构和位点的识别问题的解决提供了新的线索。  相似文献   

5.
完整基因结构的预测是当前生命科学研究的一个重要基础课题,其中一个关键环节是剪接位点和各种可变剪接事件的精确识别.基于转录组测序(RNA-seq)数据,识别剪接位点和可变剪接事件是近几年随着新一代测序技术发展起来的新技术策略和方法.本工作基于黑腹果蝇睾丸RNA-seq数据,使用TopHat软件成功识别出39718个果蝇剪接位点,其中有10584个新剪接位点.同时,基于剪接位点的不同组合,针对各类型可变剪接特征开发出计算识别算法,成功识别了8477个可变剪接事件(其中新识别的可变剪接事件3922个),包括可变供体位点、可变受体位点、内含子保留和外显子缺失4种类型.RT-PCR实验验证了2个果蝇基因上新识别的可变剪接事件,发现了全新的剪接异构体.进一步表明,RNA-seq数据可有效应用于识别剪接位点和可变剪接事件,为深入揭示剪接机制及可变剪接生物学功能提供新思路和新手段.  相似文献   

6.
基于支持向量机的人类5’非翻译区剪接位点识别   总被引:5,自引:0,他引:5  
基因非编码区域剪接位点的识别是基因识别中一个非常具有挑战性的问题,尤其是5’非翻译区中剪接位点的识别。与一般剪接位点不同,5’非翻译区剪接位点的两侧不存在由编码到非编码的状态转移,所以通常的剪接位点识别算法在非翻译区的性能不太理想。文章采用了基于支持向量机的方法对5’非翻译区中的剪接位点进行识别。为了提高识别精度,采用了基于矩阵相似性度量的核函数参数选取方法,它能够简单快速地确定合适的核函数参数,进而提高核函数的识别性能。通过实验验证,经过参数选择后的支持向量机能够较好地识别5'非翻译区剪接位点。  相似文献   

7.
针对传统基因剪接位点识别方法具有所用到的序列长,且参数多的问题,论文提出了一种基于KL距离的变长马尔可夫模型(Kullback Leibler divergence-variable length Markovmodel,KL-VLMM)。该模型在变长马尔可夫模型的基础上进行改进,由KL距离代替原来的概率比值来判断序列扩展的方向,有效地提高了特征序列的识别能力,且模型阶数由二阶降为一阶,降低了算法的空间复杂度。利用人类剪接位点数据库N269,对该模型和其他传统方法的识别性能进行了比较。实验结果表明,采用KL-VLMM方法预测人类基因剪接位点的预测效果更好。  相似文献   

8.
为提高非翻译区剪接位点识别的精度,提出一种统计概率与支持向量机相结合的识别方法 .该方法主要分为两个阶段,第一阶段应用统计学方法对非翻译区(UTR)序列进行描述,将序列中各碱基之间的相关性、位置特异性、保守性等特征用概率形式描述,以概率参数作为第二阶段支持向量机的输入向量,第二阶段应用带有多项式核函数的支持向量机(SVM)对剪接位点进行识别.通过对人类5′UTR剪接位点数据集进行测试,结果表明:该方法对非翻译区剪接位点的识别取得了很好的效果.  相似文献   

9.
低维输入空间的支持向量机识别人类剪接位点   总被引:1,自引:0,他引:1  
真核生物剪接位点的识别作为基因阵构成的向量来表示序列,用支持向量机在六维向量空间中寻找最优超平面,从而将真实的剪接位点和虚假的剪接位点进行分类.计算结果表明,利用这样的算法预测人类的剪接位点,有较好的预测效果.与其他的一些算法相比,表现出参数少,精度高等优点.  相似文献   

10.
基于机器学习的高精度剪接位点识别是真核生物基因组注释的关键.本文采用卡方测验确定序列窗口长度,构建卡方统计差表提取位置特征,并结合碱基二联体频次表征序列;针对剪接位点正负样本高度不均衡这一情形,构建10个正负样本均衡的支持向量机分类器,进行加权投票决策,有效解决了不平衡模式分类问题. HS~3D数据集上的独立测试结果显示,供体、受体位点预测准确率分别达到93.39%、90.46%,明显高于参比方法.基于卡方统计差表的位置特征能有效表征DNA序列,在分子序列信号位点识别中具有应用前景.  相似文献   

11.
12.
13.
A new method which predicts internal exon sequences in human DNA has been developed. The method is based on a splice site prediction algorithm that uses the linear discriminant function to combine information about significant triplet frequencies of various functional parts of splice site regions and preferences of oligonucleotides in protein coding and intron regions. The accuracy of our splice site recognition function is 97% for donor splice sites and 96% for acceptor splice sites. For exon prediction, we combine in a discriminant function the characteristics describing the 5'-intron region, donor splice site, coding region, acceptor splice site and 3'-intron region for each open reading frame flanked by GT and AG base pairs. The accuracy of precise internal exon recognition on a test set of 451 exon and 246693 pseudoexon sequences is 77% with a specificity of 79%. The recognition quality computed at the level of individual nucleotides is 89% for exon sequences and 98% for intron sequences. This corresponds to a correlation coefficient for exon prediction of 0.87. The precision of this approach is better than other methods and has been tested on a larger data set. We have also developed a means for predicting exon-exon junctions in cDNA sequences, which can be useful for selecting optimal PCR primers.  相似文献   

14.

Background  

Accurate selection of splice sites during the splicing of precursors to messenger RNA requires both relatively well-characterized signals at the splice sites and auxiliary signals in the adjacent exons and introns. We previously described a feature generation algorithm (FGA) that is capable of achieving high classification accuracy on human 3' splice sites. In this paper, we extend the splice-site prediction to 5' splice sites and explore the generated features for biologically meaningful splicing signals.  相似文献   

15.
Vertebrate internal exons are usually between 50 and 400 nt long; exons outside this size range may require additional exonic and/or intronic sequences to be spliced into the mature mRNA. The mouse polymeric immunoglobulin receptor gene has a 654 nt exon that is efficiently spliced into the mRNA. We have examined this exon to identify features that contribute to its efficient splicing despite its large size; a large constitutive exon has not been studied previously. We found that a strong 5′ splice site is necessary for this exon to be spliced intact, but the splice sites alone were not sufficient to efficiently splice a large exon. At least two exonic sequences and one evolutionarily conserved intronic sequence also contribute to recognition of this exon. However, these elements have redundant activities as they could only be detected in conjunction with other mutations that reduced splicing efficiency. Several mutations activated cryptic 5′ splice sites that created smaller exons. Thus, the balance between use of these potential sites and the authentic 5′ splice site must be modulated by sequences that repress or enhance use of these sites, respectively. Also, sequences that enhance cryptic splice site use must be absent from this large exon.  相似文献   

16.
17.
The oomycetes, a distinct phylogenetic lineage of fungus-like microorganisms, are heterokonts (stramenopiles) belonging to the supergroup Chromalveolata. Although the complete genomic sequences of a number of oomycetes have been reported, little information regarding the introns therein is available. Here, we investigated the introns of Phytophthora sojae, a pathogen that causes soybean root and stem rot, by a comparative analysis of genomic sequences and expressed sequence tags. A total of 4013 introns were identified, of which 96.6% contained canonical splice sites. The P. sojae genome possessed features distinct from other organisms at 5' splice sites, polypyrimidine tracts, branch sites, and 3' splice sites. Diverse repeating sequences, ranging from 2 to 10 nucleotides in length, were found at more than half of the intron-exon boundaries. Furthermore, 122 genes underwent alternative splicing. These data indicate that P. sojae has unique splicing mechanisms, and recognition of those mechanisms may lead to more accurate predictions of the location of introns in P. sojae and even other oomycete species.  相似文献   

18.
A database (SpliceDB) of known mammalian splice site sequences has been developed. We extracted 43 337 splice pairs from mammalian divisions of the gene-centered Infogene database, including sites from incomplete or alternatively spliced genes. Known EST sequences supported 22 815 of them. After discarding sequences with putative errors and ambiguous location of splice junctions the verified dataset includes 22 489 entries. Of these, 98.71% contain canonical GT-AG junctions (22 199 entries) and 0.56% have non-canonical GC-AG splice site pairs. The remainder (0.73%) occurs in a lot of small groups (with a maximum size of 0.05%). We especially studied non-canonical splice sites, which comprise 3.73% of GenBank annotated splice pairs. EST alignments allowed us to verify only the exonic part of splice sites. To check the conservative dinucleotides we compared sequences of human non-canonical splice sites with sequences from the high throughput genome sequencing project (HTG). Out of 171 human non-canonical and EST-supported splice pairs, 156 (91.23%) had a clear match in the human HTG. They can be classified after sequence analysis as: 79 GC-AG pairs (of which one was an error that corrected to GC-AG), 61 errors corrected to GT-AG canonical pairs, six AT-AC pairs (of which two were errors corrected to AT-AC), one case was produced from a non-existent intron, seven cases were found in HTG that were deposited to GenBank and finally there were only two other cases left of supported non-canonical splice pairs. The information about verified splice site sequences for canonical and non-canonical sites is presented in SpliceDB with the supporting evidence. We also built weight matrices for the major splice groups, which can be incorporated into gene prediction programs. SpliceDB is available at the computational genomic Web server of the Sanger Centre: http://genomic.sanger.ac. uk/spldb/SpliceDB.html and at http://www.softberry. com/spldb/SpliceDB.html.  相似文献   

19.
The activation of cryptic 5' splice sites (5' SSs) is often related to human hereditary diseases. The DNA-based mutation screening strategies are commonly used to recognize the cryptic 5' SSs, because features of the local DNA sequence can influence the choice of cryptic 5' SSs. To improve the identification of the cryptic 5' SSs, we developed a structure-based method, named SPO (structure profiles and odds measure), which combines two parameters, the structural feature derived from hydroxyl radical cleavage pattern and odds measure, to assess the likelihood of a cryptic 5' SS activation in competing with its paired authentic 5' SS. Compared to the current tools for identifying activated cryptic 5' SSs, the SPO algorithm achieves higher prediction accuracy than the other methods, including MaxEnt, MDD, Markov model, weight matrix model, Shapiro and Senapathy matrix, R(i) and ΔG. In addition, the predicted ΔSPO scores from the SPO algorithm exhibited a greater degree of correlation with the strength of cryptic 5' SS activation than that measured from the other seven methods. In conclusion, the SPO algorithm provides an optimal identification of cryptic 5' SSs, can be applied in designing mutagenesis experiments for various splicing events and may be helpful to investigate the relationship between structural variants and human hereditary diseases.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号