首页 | 本学科首页   官方微博 | 高级检索  
   检索      

基于统计差表与加权投票的高精度剪接位点预测
引用本文:曾莹,陈渊,袁哲明.基于统计差表与加权投票的高精度剪接位点预测[J].生物化学与生物物理进展,2019,46(5):496-503.
作者姓名:曾莹  陈渊  袁哲明
作者单位:湖南农业大学,湖南省农业大数据分析与决策工程技术研究中心,长沙 410128;湖南农业大学东方科技学院,长沙 410128,湖南农业大学,湖南省农业大数据分析与决策工程技术研究中心,长沙 410128,湖南农业大学,湖南省农业大数据分析与决策工程技术研究中心,长沙 410128
基金项目:国家自然科学基金(61701177),湖南省自然科学基金(2018JJ3225)和湖南省教育厅科学研究项目(17A096)资助.
摘    要:基于机器学习的高精度剪接位点识别是真核生物基因组注释的关键.本文采用卡方测验确定序列窗口长度,构建卡方统计差表提取位置特征,并结合碱基二联体频次表征序列;针对剪接位点正负样本高度不均衡这一情形,构建10个正负样本均衡的支持向量机分类器,进行加权投票决策,有效解决了不平衡模式分类问题. HS~3D数据集上的独立测试结果显示,供体、受体位点预测准确率分别达到93.39%、90.46%,明显高于参比方法.基于卡方统计差表的位置特征能有效表征DNA序列,在分子序列信号位点识别中具有应用前景.

关 键 词:剪接位点  位置特征  卡方统计差表  加权投票  支持向量机
收稿时间:2018/10/15 0:00:00
修稿时间:2019/3/21 0:00:00

High-accuracy Splice Site Prediction Based on Statistical Difference Table and Weighted Voting
ZENG Ying,CHEN Yuan and YUAN Zhe-Ming.High-accuracy Splice Site Prediction Based on Statistical Difference Table and Weighted Voting[J].Progress In Biochemistry and Biophysics,2019,46(5):496-503.
Authors:ZENG Ying  CHEN Yuan and YUAN Zhe-Ming
Institution:Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-making, Hunan Agricultural University, Changsha, 410128, China;Orient Science &Technology College, Hunan Agricultural University, Changsha, 410128, China,Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-making, Hunan Agricultural University, Changsha, 410128, China,Hunan Engineering & Technology Research Center for Agricultural Big Data Analysis & Decision-making, Hunan Agricultural University, Changsha, 410128, China
Abstract:High-accuracy splice site recognition based on machine learning is the key to eukaryotic genome annotation. In this paper, we used chi-square test to determine the window size of sequences, and constructed a chi-square statistical difference table to extract the positional features, and combined with the frequencies of dinucleotides to characterize sequences. For the problem that the positive and negative samples of splice sites are extremely imbalanced, 10 SVM classifiers based on the equal proportion of positive and negative samples were built for weighted voting, which effectively solved the imbalanced pattern classification problem. Independent testing results in HS3D dataset showed that the prediction accuracy of donor and acceptor sites were 93.39% and 90.46% respectively, obviously higher than that of the compared methods. The positional features based on the chi-square statistical difference table can effectively characterize DNA sequences, and have application prospects in signal site recognition of molecular sequences.
Keywords:splice site  positional features  chi-square statistical difference table  weighted voting  support vector machine (SVM)
本文献已被 CNKI 等数据库收录!
点击此处可从《生物化学与生物物理进展》浏览原始摘要信息
点击此处可从《生物化学与生物物理进展》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号