共查询到20条相似文献,搜索用时 156 毫秒
1.
2.
基于支持向量机(SVM)的剪接位点识别 总被引:14,自引:1,他引:13
剪接位点的识别作为基因识别中的一个重要环节, 一直受到研究人员的关注。考虑到剪接位点附近存在的序列保守性,已有一些基于统计特性的方法被用于剪接位点的识别中,但效果仍有待进一步改进。支持向量机(Support Vector Machines) 作为一种新的基于统计学习理论的学习机,近几年有了很大的发展,已被应用在模式识别的许多问题中。文中将其用于剪接位点的识别中,并针对满足GT- AG 规则的序列样本中虚假剪接位点的样本数远大于真实位点这一特性, 提出了一种基于SVM 的平衡取小法以获得更好的识别效果。实验结果表明,应用支持向量机进行剪接位点的识别能更好地提取位点附近保守序列的统计特征,对测试集具有更好的推广能力,并且使用上更加简单。这一结果为剪接位点的识别提供了一种新的方法,同时也为生物大分子研究中结构和位点的识别问题的解决提供了新的线索。 相似文献
3.
4.
新近的基因识别软件比先前的软件有着显著的提高 ,但是在外显子水平上的敏感性和特异性仍然不十分令人满意 .这是因为已有软件对于剪接位点 ,翻译起始等生物信号位点的识别还不够有效 .如果能够分别提高这些生物信号位点的识别效果 ,就能够提高整体的基因识别效率 .隐半马氏模型能够很好地刻画 3′剪接位点 (acceptor)的结构 .据此开发的一套对acceptor进行识别的算法在Burset/Guigo的数据集上经过检验 ,获得了比已有算法更好的识别率 .该模型的成功还使得我们对剪接点上游的分支位点和嘧啶富含区的概貌有了一定的认识 ,加深了人们对于acceptor的结构和剪接过程的理解 相似文献
5.
新近的基因识别软件比先前的软件有着显著的提高,但是在外显子水平上的敏感性和特异性仍然不十分令人满意.这是因为已有软件对于剪接位点,翻译起始等生物信号位点的识别还不够有效.如果能够分别提高这些生物信号位点的识别效果,就能够提高整体的基因识别效率.隐半马氏模型能够很好地刻画3'剪接位点(acceptor)的结构.据此开发的一套对acceptor进行识别的算法在Burset/Guigo的数据集上经过检验,获得了比已有算法更好的识别率.该模型的成功还使得我们对剪接点上游的分支位点和嘧啶富含区的概貌有了一定的认识,加深了人们对于acceptor的结构和剪接过程的理解. 相似文献
6.
新近的基因识别软件比先前的软件有着显著的提高,但是在外显子水平上的敏感性和特异性仍然不十分令人满意.这是因为已有软件对于剪接位点,翻译起始等生物信号位点的识别还不够有效.如果能够分别提高这些生物信号位点的识别效果,就能够提高整体的基因识别效率.隐半马氏模型能够很好地刻画3′剪接位点(acceptor)的结构.据此开发的一套对acceptor进行识别的算法在Burset/Guigo的数据集上经过检验,获得了比已有算法更好的识别率.该模型的成功还使得我们对剪接点上游的分支位点和嘧啶富含区的概貌有了一定的认识,加深了人们对于acceptor的结构和剪接过程的理解. 相似文献
7.
8.
DNA序列功能位点的识别是目前生物信息学领域的一个研究热点,剪接位点的识别就是其中之一.为了充分利用剪接位点的特征模式,从而更好地识别剪接位点,建立了一个基于改进Winnow算法的剪接位点识别系统.与其他方法相比较,改进的Winnow算法具有更好的鲁棒性,适用于高维特征空间,能够融合多种模式信息,即使在包含很多不相关特征的情况下,也能有很好的性能.同时在训练的时候,对特征集进行了剪枝,把一些对识别几乎没有贡献的特征去除,这样做对结果的影响可以忽略,而且提高了算法的效率.通过实验验证,改进的Winnow算法可以很好地识别剪接位点,其多个性能指标达到或超过目前国际上流行的剪接位点识别软件. 相似文献
9.
基于机器学习的高精度剪接位点识别是真核生物基因组注释的关键.本文采用卡方测验确定序列窗口长度,构建卡方统计差表提取位置特征,并结合碱基二联体频次表征序列;针对剪接位点正负样本高度不均衡这一情形,构建10个正负样本均衡的支持向量机分类器,进行加权投票决策,有效解决了不平衡模式分类问题. HS~3D数据集上的独立测试结果显示,供体、受体位点预测准确率分别达到93.39%、90.46%,明显高于参比方法.基于卡方统计差表的位置特征能有效表征DNA序列,在分子序列信号位点识别中具有应用前景. 相似文献
10.
11.
Prediction of splice sites in non-coding regions of genes is one of the most challenging aspects of gene structure recognition. We perform a rigorous analysis of such splice sites embedded in human 5' untranslated regions (UTRs), and investigate correlations between this class of splice sites and other features found in the adjacent exons and introns. By restricting the training of neural network algorithms to 'pure' UTRs (not extending partially into protein coding regions), we for the first time investigate the predictive power of the splicing signal proper, in contrast to conventional splice site prediction, which typically relies on the change in sequence at the transition from protein coding to non-coding. By doing so, the algorithms were able to pick up subtler splicing signals that were otherwise masked by 'coding' noise, thus enhancing significantly the prediction of 5' UTR splice sites. For example, the non-coding splice site predicting networks pick up compositional and positional bias in the 3' ends of non-coding exons and 5' non-coding intron ends, where cytosine and guanine are over-represented. This compositional bias at the true UTR donor sites is also visible in the synaptic weights of the neural networks trained to identify UTR donor sites. Conventional splice site prediction methods perform poorly in UTRs because the reading frame pattern is absent. The NetUTR method presented here performs 2-3-fold better compared with NetGene2 and GenScan in 5' UTRs. We also tested the 5' UTR trained method on protein coding regions, and discovered, surprisingly, that it works quite well (although it cannot compete with NetGene2). This indicates that the local splicing pattern in UTRs and coding regions is largely the same. The NetUTR method is made publicly available at www.cbs.dtu.dk/services/NetUTR. 相似文献
12.
Prediction of recognition sites for genomic replication of classical swine fever virus with information analysis 总被引:1,自引:0,他引:1
In order to explore the mechanism for the genomic replication of classical swine fever virus (CSFV), so as to make a basis for investigating its pathogenicity, an introduction of the information theory is presented in connection with the statistical mechanics, whence small-sample statistics appears naturally as a consequence of the Bayesian approach. Furthermore, a selection rule for identifying the pattern of a recognition site for an RNA-binding protein is proposed by means of the maximum entropy principle. Based on those, the information contents of 3'-untranslated regions (3'UTRs) of genomes of 20 CSFV strains and 5'-untranslated regions (5'UTRs) of genomes of 58 CSFV strains are analyzed with a computational algorithm in a reduction mode, and the 3'UTR sites of 20 strains and 5'UTR sites of 58 strains containing important motifs are extracted from the unaligned RNA sequences of unequal lengths. These sites, which have the patterns of sequence and structure similar to the putative cis elements related to the regulation of genomic replication, would be identified as the potential recognition sites in 3'UTRs and 5'UTRs for CSFV replicase responsible for classical swine fever virus genomic replication, and to some extent, this identification is supported by experimental evidence. Finally, information analysis allows a presumption to be made about the CSFV RNA replication initiation mechanism. 相似文献
13.
14.
In order to explore the mechanism for the genomic replication of classical swine fever virus (CSFV), so as to make a basis for investigating its pathogenicity, an introduction of the information theory is presented in connection with the statistical mechanics, whence small-sample statistics appears naturally as a consequence of the Bayesian approach. Furthermore, a selection rule for identifying the pattern of a recognition site for an RNA-binding protein is proposed by means of the maximum entropy principle. Based on those, the information contents of 3"-untranslated regions (3"UTRs) of genomes of 20 CSFV strains and 5"-untranslated regions (5"UTRs) of genomes of 58 CSFV strains are analyzed with a computational algorithm in a reduction mode, and the 3"UTR sites of 20 strains and 5"UTR sites of 58 strains containing important motifs are extracted from the unaligned RNA sequences of unequal lengths. These sites, which have the patterns of sequence and structure similar to the putative cis elements related to the regulation of genomic replication, would be identified as the potential recognition sites in 3"UTRs and 5"UTRs for CSFV replicase responsible for classical swine fever virus genomic replication, and to some extent, this identification is supported by experimental evidence. Finally, information analysis allows a presumption to be made about the CSFV RNA replication initiation mechanism. 相似文献
15.
16.
Adenosine to inosine (A-to-I) RNA editing is the most abundant editing event in animals. It converts adenosine to inosine in double-stranded RNA regions through the action of the adenosine deaminase acting on RNA (ADAR) proteins. Editing of pre-mRNA coding regions can alter the protein codon and increase functional diversity. However, most of the A-to-I editing sites occur in the non-coding regions of pre-mRNA or mRNA and non-coding RNAs. Untranslated regions (UTRs) and introns are located in pre-mRNA non-coding regions, thus A-to-I editing can influence gene expression by nuclear retention, degradation, alternative splicing, and translation regulation. Non-coding RNAs such as microRNA (miRNA), small interfering RNA (siRNA) and long non-coding RNA (lncRNA) are related to pre-mRNA splicing, translation, and gene regulation. A-to-I editing could therefore affect the stability, biogenesis, and target recognition of non-coding RNAs. Finally, it may influence the function of non-coding RNAs, resulting in regulation of gene expression. This review focuses on the function of ADAR-mediated RNA editing on mRNA non-coding regions (UTRs and introns) and non-coding RNAs (miRNA, siRNA, and lncRNA). 相似文献
17.
18.
The phenomenon of nonsense-associated altered splicing raises the possibility that the recognition of in-frame nonsense codons is used generally for exon identification during pre-mRNA splicing. However, nonsense codon frequencies in pseudo exons and in regions flanking 5' splice sites are no greater than that expected by chance, arguing against the widespread use of this strategy as a means of rejecting potential splice sites. 相似文献
19.