首页 | 本学科首页   官方微博 | 高级检索  
   检索      

基于Web服务的流感病毒基因组自动化翻译注释系统
引用本文:陈翠霞,杨磊,蒋太交,王小龙,曹宗富,李天君,于磊,高华方,马旭.基于Web服务的流感病毒基因组自动化翻译注释系统[J].病毒学报,2021,37(2):309-317.
作者姓名:陈翠霞  杨磊  蒋太交  王小龙  曹宗富  李天君  于磊  高华方  马旭
作者单位:国家卫生健康委科学技术研究所人类遗传资源中心,北京100081;国家人类遗传资源中心生物信息组,北京102206;国家流感中心生物信息组,北京100055;中国科学院生物物理研究所蛋白质多肽重点实验室,北京100101;中国科学院空天信息创新研究院七室,北京100100
基金项目:“十三五”国家重点研发计划(项目号:2016YFC1000307),题目:生殖遗传资源和生殖健康大数据平台建设与应用示范;中央公益性科研机构基础研究基金(项目号:2018GJM06),题目:HPV病毒基因组学分析与毒力、耐药性研究;中央公益性科研机构基础研究基金(项目号:2020GJM05),题目:单基因病名称机器学习智能精准推荐技术研发;国家人口与生殖健康科学数据中心(项目号:2005DKA32408),题目:复杂遗传病远程协同服务网~~。
摘    要:随着流感病毒基因组测序数据的急剧增加,深入挖掘流感病毒基因组大数据蕴含的生物学信息成为研究热点。基于中国流感病毒流行特征数据,建设一个集自动化、一体化和信息化的序列库系统,对于实现流感病毒基因组批量快速翻译、注释、存储、查询、分析具有重要的应用价值。本课题组通过集成一系列软件和工具包,并结合自主研发的其他功能,在底层维护的2个关键的参考数据集基础上另外追加了翻译注释信息最佳匹配的精细化筛选规则,构建具有流感病毒基因组信息存储、自动化翻译、蛋白序列精准注释、同源序列比对和进化树分析等功能的自动化系统。结果显示,通过Web端输入fasta格式的流感病毒基因序列,本系统可针对参考序列片段数据集(blastdb.fasta)进行Blast同源性检索,可以鉴定流感病毒的型别(A、B或C)、亚型和基因片段(1~8片段);在此基础上,通过查询数据库底层用于翻译、注释的基因片段参考数据集,可以获得一组肽段数据集,然后通过循环调用ProSplign软件对其进行预测。结合精细化的筛选准入规则,选出与输入序列匹配最好的翻译后产物,作为该输入序列的预测蛋白,输出为gbk,asn和fasta等通用格式的文件,给出序列长度、是否全长、病毒型别、亚型、片段等信息。基于以上工作,另外自主研发了系统其他的附加功能如进化树分析展示、基因组数据存储等功能,构建成基于Web服务的流感病毒基因组自动化翻译注释系统。本研究提示,系统高度集成系列软件以及自有的注释翻译数据库文件,实现从序列存储、翻译、注释到序列分析和展示的功能,可全面满足我国高通量基因检测数据共享化、本土化、一体化、自动化的需求。

关 键 词:流感病毒  关键数据集  翻译  注释  一体化序列库

Construction of an Automatic Translation and Annotation System for Influenza Virus Genomes Based on an Internet Service
CHEN Cuixia,YANG Lei,JIANG Taijiao,WANG Xiaolong,CAO Zongfu,LI Tianjun,YU Lei,GAO Huafang,MA Xu.Construction of an Automatic Translation and Annotation System for Influenza Virus Genomes Based on an Internet Service[J].Chinese Journal of Virology,2021,37(2):309-317.
Authors:CHEN Cuixia  YANG Lei  JIANG Taijiao  WANG Xiaolong  CAO Zongfu  LI Tianjun  YU Lei  GAO Huafang  MA Xu
Institution:(Bioinformatic group,Center of Human Genetic Resources,National Research Institute for Family Planning,Beijing 100081,China;Bioinformatics Department,National Center of Human Genetic Resources,Beijing 102206,China;Bioinformatics Group,Chinese National Influenza Center,Beijing 100055,China;Key Laboratory of Protein and Peptide Drugs,Institute of Biophysics,Chinese Academy of Sciences,Beijing 100101,China;7th Department,Institute of Electrics,Chinese Academy of Sciences,Beijing 100100,China)
Abstract:To obtain information from next-generation sequencing of influenza viruses by mining"big data",as well as translation and annotation of gene sequences,a system was developed based on an Internet service.We undertook genome translation and annotation of influenza viruses by integrating ProSplign software and the Jalview toolkit.In this way,we could verify and predict the protein sequences of the coding sequences of the influenza viruses inputted.Two key reference datasets were maintained in this system.One dataset for homology enabled searches to identify the type,subtype and gene segment of influenza viruses(blastdb.fasta).The other dataset was used as an input file for loop-calling ProSplign software to realize the function of translation and annotation.There are also rules for fine filtering for the best match of information on translation and annotation for the input sequence.The system includes additional functions,such as information storage,analyses of phylogenetic trees,and automatic translation.Results showed that based on the Internet input file in fasta format,our system can retrieve subjective sequences with the information of the types(A,B or C),subtypes and gene segments(1–8)for influenza virus identification through the Blast program against Set A.Based on this result,a set of peptides can be obtained through cyclically calling ProSplign software to query Set B.Based on the selection criteria,the best match of the input sequence is selected from a list of prediction results for protein prediction.The output file is organized into general formats such as gbk,asn and fasta.The system includes information on the length/full-length of the sequence,virus type,subtype and segment.The present study suggested that the system is highly integrated with a series of software programs and its own translation and annotation datasets.It can support the requirements for sequence storage,translation,annotation,analyses and display.Therefore,our system fully meets the needs of sharing,localization,integration and automation of high-throughput sequencing data in China.
Keywords:Influenza virus  Key data sets  Translation  Annotation  Integrated sequence system
本文献已被 CNKI 维普 万方数据 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号