首页 | 本学科首页   官方微博 | 高级检索  
   检索      

基于机器学习的蛋白质编码区识别
引用本文:包晓娜,何黎黎,崔景安.基于机器学习的蛋白质编码区识别[J].生物信息学,2023,21(4):270-276.
作者姓名:包晓娜  何黎黎  崔景安
作者单位:北京建筑大学 理学院 北京102616
摘    要:针对DNA序列编码区的识别问题,本研究提出一个特征向量和逻辑回归的组合模型。首先对DNA序列进行数值处理转化为特征向量,并结合k字符相对频率技术提取特征向量的元素特征,之后利用二分类逻辑回归算法,对编码区和非编码区进行准确区分。选取了HMR195和BG570两个基准数据集进行五折交叉验证,结果表明,平均AUC(Area Under Curve)值分别为0.981 3和0.987 4,明显优于传统的贝叶斯判别法和VOSSDFT等方法。此外,本文提出的特征向量的维度很低,提高了运算效率。因此,本文组合模型能够较为高效准确地识别蛋白质编码区。

关 键 词:编码区  特征向量  逻辑回归  机器学习
收稿时间:2022/6/9 0:00:00
修稿时间:2022/10/27 0:00:00

Identification of protein coding region based on machine learning
BAO Xiaon,HE Lili,CUI Jingan.Identification of protein coding region based on machine learning[J].China Journal of Bioinformation,2023,21(4):270-276.
Authors:BAO Xiaon  HE Lili  CUI Jingan
Institution:School of Science, Beijing University of Civil Engineering and Architecture, Beijing 102616, China
Abstract:In order to identify the coding region of DNA sequence, a combined model of eigenvector and logistic regression is proposed in this article. Firstly, the DNA sequence is transformed into a feature vector by numerical processing, and the element features of the feature vector are extracted by combining the k-character relative frequency technology. Then, the binary classification logistic regression algorithm is used to accurately distinguish the coding region from the non-coding region. Two benchmark data sets, HMR195 and BG570, were selected for five-fold cross-validation. The results showed that the average AUC (Area Under Curve) values were 0.981 3 and 0.987 4 respectively, which are significantly better than the traditional Bayesian discriminant method and VOSSDFT. In addition, the dimension of the feature vector in this article is very low, which improves the operation efficiency. Therefore, the combined model in this article can identify protein coding regions more efficiently and accurately.
Keywords:Protein coding region  Feature vector  Logistic regression  Machine learning
点击此处可从《生物信息学》浏览原始摘要信息
点击此处可从《生物信息学》下载免费的PDF全文
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号