首页 | 本学科首页   官方微博 | 高级检索  
   检索      

基于卷积神经网络的大肠杆菌启动子预测
引用本文:彭宝成,张晓炜,刘暘,樊国梁.基于卷积神经网络的大肠杆菌启动子预测[J].生物化学与生物物理进展,2022,49(7):1334-1347.
作者姓名:彭宝成  张晓炜  刘暘  樊国梁
作者单位:1)内蒙古大学物理科学与技术学院,呼和浩特 010021,2)内蒙古医科大学第一附属医院风湿免疫科,呼和浩特 010050,2)内蒙古医科大学第一附属医院风湿免疫科,呼和浩特 010050,1)内蒙古大学物理科学与技术学院,呼和浩特 010021
基金项目:国家自然科学基金(62063024),内蒙古自治区高等学校科学研 究项目(NJZY20005) 和内蒙古大学大学生创新创业训练计划项 目(201912240) 资助。
摘    要:目的 基于位点特异性打分矩阵(position-specific scoring matrices,PSSM)的预测模型已经取得了良好的效果,基于PSSM的各种优化方法也在不断发展,但准确率相对较低,为了进一步提高预测准确率,本文基于卷积神经网络(convolutional neural networks,CNN)算法做了进一步研究。方法 采用PSSM将启动子序列处理成数值矩阵,通过CNN算法进行分类。大肠杆菌K-12(Escherichia coli K-12,E.coli K-12,下文简称大肠杆菌)的Sigma38、Sigma54和Sigma70 3种启动子序列被作为正集,编码(Coding)区和非编码(Non-coding)区的序列为负集。结果 在预测大肠杆菌启动子的二分类中,准确率达到99%,启动子预测的成功率接近100%;在对Sigma38、Sigma54、Sigma70 3种启动子的三分类中,预测准确率为98%,并且针对每一种序列的预测准确率均可以达到98%以上。最后,本文以Sigma38、Sigma54、Sigma70 3种启动子分别和Coding区或者Non-coding区序列做四分类,预测得到的准确性为0.98,对3种Sigma启动子均衡样本的十交叉检验预测精度均可以达到0.95以上,海明距离为0.016,Kappa系数为0.97。结论 相较于支持向量机(support vector machine,SVM)等其他分类算法,CNN分类算法更具优势,并且基于CNN的分类优势,编码方式亦可以得到简化。

关 键 词:大肠杆菌  位点特异性打分矩阵  卷积神经网络  多分类
收稿时间:2021/5/13 0:00:00
修稿时间:2021/7/23 0:00:00

Prediction of E.coli Promoters Based on CNN
PENG Bao-Cheng,ZHANG Xiao-Wei,LIU Yang and Fan Guo-Liang.Prediction of E.coli Promoters Based on CNN[J].Progress In Biochemistry and Biophysics,2022,49(7):1334-1347.
Authors:PENG Bao-Cheng  ZHANG Xiao-Wei  LIU Yang and Fan Guo-Liang
Institution:1)School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China,2)Department of Rheumatology, the First Affiliated Hospital, Inner Mongolia Medical University, Hohhot 010050, China,2)Department of Rheumatology, the First Affiliated Hospital, Inner Mongolia Medical University, Hohhot 010050, China,1)School of Physical Science and Technology, Inner Mongolia University, Hohhot 010021, China
Abstract:Objective The prediction model based on PSSM (position-specific scoring matrix) has achieved good results, and various optimization methods based on PSSM are also being continuously developed. However, the accuracy rate is relatively lower. In order to further improve the prediction accuracy rate, this paper does further research based on the CNN algorithm.Methods In this paper, PSSM is used to process the letter sequence into a numeric matrix, and through a convolutional neural network (CNN) algorithm for classification. The 3 promoter sequences of Sigma38, Sigma54 and Sigma70 of E.coli K-12 (Escherichia coli K-12, hereinafter referred to as Escherichia coli) are used as the positive sets, and the sequences of the Coding and Non-coding regions of Escherichia coli are the negative set.Results In the prediction of Escherichia coli for the two-classification for promoters, the accuracy rate reaches 99%, and the success rate of promoter prediction is close to 100%; in the three-classification for Sigma38, Sigma54 and Sigma70 promoters, the prediction accuracy rate is 98%, and for each the prediction accuracy of these sequences can reach 0.98 or more. Finally, we tried 4 classifications of 3 promoters of Sigma38, Sigma54 and Sigma70 with Coding area or Non-coding area sequences respectively, the accuracy of prediction was 0.98. The prediction accuracy of the ten-fold cross-validation of the balanced samples of the Sigma promoters can reach more than 0.95, the Hamming distance is 0.016, and the Kappa coefficient is 0.97.Conclusion Compared with other classification algorithms such as SVM (support vector machine), the CNN classification algorithm has more advantages, and based on the classification advantages of CNN, the coding method can also be simplified.
Keywords:Escherichia coli  position specific scoring matrix  CNN  multi-classification
点击此处可从《生物化学与生物物理进展》浏览原始摘要信息
点击此处可从《生物化学与生物物理进展》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号