首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
原核生物操纵子结构的准确注释对基因功能和基因调控网络的研究具有重要意义,通过生物信息学方法计算预测是当前基因组操纵子结构注释的最主要来源.当前的预测算法大都需要实验确认的操纵子作为训练集,但实验确认的操纵子数据的缺乏一直成为发展算法的瓶颈.基于对操纵子结构的认识,从基因间距离、转录翻译相关的调控信号以及COG功能注释等特征出发,建立了描述操纵子复杂结构的概率模型,并提出了不依赖于特定物种操纵子数据作为训练集的迭代自学习算法.通过对实验验证的操纵子数据集的测试比较,结果表明算法对于预测操纵子结构非常有效.在不依赖于任何已知操纵子信息的情况下,算法在总体预测水平上超过了目前最好的操纵子预测方法,而且这种自学习的预测算法要优于依赖特定物种进行训练的算法.这些特点使得该算法能够适用于新测序的物种,有别于当前常用的操纵子预测方法.对细菌和古细菌的基因组进行大规模比较分析,进一步提高了对基因组操纵子结构的普遍特征和物种特异性的认识.  相似文献   

2.
基于支持向量机识别真核生物DNA中的翻译起始位点   总被引:2,自引:1,他引:1  
翻译起始位点(TIS)的识别是真核生物基因预测的关键步骤之一,近年来一直得到研究人员的高度重视。基于TIS附近序列的统计特性,出现了一些辨识TIS的判别方法,但识别精度还有待进一步提高。针对传统支持向量机(SVM)方法中存在的不足,提出了基于数据优化法的SVM,它通过其它统计学模型优化训练数据集,进而提高分类器的辨识精度。实验结果表明基于数据优化法的SVM分类器在翻译起始位点的辨识上可获得比其他判别方法更好的效果。  相似文献   

3.
刘林梦  温权  欧竑宇 《微生物学通报》2014,41(12):2583-2592
【目的】为识别已完成全测序细菌基因组中的ncRNA基因,对3个常用ncRNA预测工具s RNAPredict、PORTRAIT和s RNAscanner进行评估。【方法】选择了细菌ncRNA数据库BSRD收录的含有已知ncRNA基因数目大于30的9个细菌基因组,并按基因组G+C含量进行分类,比较s RNAPredict和PORTRAIT工具的预测准确性。提取不同G+C含量基因组中ncRNA基因转录起始和终止区的序列特征,对s RNAscanner预测结果进行评估。【结果】s RNAPredict对细菌ncRNA基因的预测特异性和阳性检出率均高于PORTRAIT,而敏感性则较差;两种工具预测效果均随基因组G+C含量不同而产生明显变化。在不同G+C含量的细菌基因组中,ncRNA基因启动子和终止子区域的序列特征有明显差异。利用这些序列特征能提高s RNAscanner预测ncRNA基因的平均水平。【结论】3种ncRNA基因工具预测效果随基因组G+C含量变化而不同。不同G+C含量基因组中ncRNA基因的转录起始和终止区特征可作为ncRNA基因预测的重要参数之一。  相似文献   

4.
随着流感病毒基因组测序数据的急剧增加,深入挖掘流感病毒基因组大数据蕴含的生物学信息成为研究热点。基于中国流感病毒流行特征数据,建设一个集自动化、一体化和信息化的序列库系统,对于实现流感病毒基因组批量快速翻译、注释、存储、查询、分析具有重要的应用价值。本课题组通过集成一系列软件和工具包,并结合自主研发的其他功能,在底层维护的2个关键的参考数据集基础上另外追加了翻译注释信息最佳匹配的精细化筛选规则,构建具有流感病毒基因组信息存储、自动化翻译、蛋白序列精准注释、同源序列比对和进化树分析等功能的自动化系统。结果显示,通过Web端输入fasta格式的流感病毒基因序列,本系统可针对参考序列片段数据集(blastdb.fasta)进行Blast同源性检索,可以鉴定流感病毒的型别(A、B或C)、亚型和基因片段(1~8片段);在此基础上,通过查询数据库底层用于翻译、注释的基因片段参考数据集,可以获得一组肽段数据集,然后通过循环调用ProSplign软件对其进行预测。结合精细化的筛选准入规则,选出与输入序列匹配最好的翻译后产物,作为该输入序列的预测蛋白,输出为gbk,asn和fasta等通用格式的文件,给出序列长度、是否全长、病毒型别、亚型、片段等信息。基于以上工作,另外自主研发了系统其他的附加功能如进化树分析展示、基因组数据存储等功能,构建成基于Web服务的流感病毒基因组自动化翻译注释系统。本研究提示,系统高度集成系列软件以及自有的注释翻译数据库文件,实现从序列存储、翻译、注释到序列分析和展示的功能,可全面满足我国高通量基因检测数据共享化、本土化、一体化、自动化的需求。  相似文献   

5.
人类蛋白编码基因局部GC水平相关性分析   总被引:2,自引:0,他引:2  
陈祥贵  胡军  杨潇 《遗传》2008,30(9):1169-1174
GC含量是基因组DNA序列碱基组成的重要特征, 蕴涵基因结构、功能和进化信息。文中通过从公共数据库提取7 992个非冗余的人类蛋白质编码基因DNA序列, 分析了基因序列不同区域的局部GC含量和相关性。结果表明: 基因局部GC含量呈现不均一性, 5′非翻译区GC水平最高, 为62.56%; 而3′非翻译区GC水平最低, 为43.97%。3′侧翼序列的GC含量能较好地代表基因所在区域DNA长片段的GC水平。虽然开放阅读框的GC含量比内含子、3′非翻译区和3′侧翼序列的GC含量高, 但4个区域的GC含量之间均存在较高的相关性。密码子第三位置的平均GC含量(GC3)为58.09%, 显著高于密码子第一位置和第二位置的GC含量, 且与开放阅读框的GC水平高度相关, 相关系数高达0.91。GC3与内含子、3′非翻译区、3′侧翼序列的GC水平相关性也较高, GC3对3′侧翼序列的GC含量的直线回归斜率为1.25。因此, GC3可作为基因所在区域GC水平变化的敏感性指标。而密码子第一位置和第二位置以及5′侧翼序列和5′非翻译区GC水平与基因其他区域的GC水平的相关性较弱。该研究结果提示: 基因蛋白编码区密码子第三位置、内含子、3′非翻译区和3′侧翼序列的碱基可能经历了相近的进化过程, 而蛋白编码区密码子第一位置和第二位置、5′侧翼序列和5′非翻译区由于功能的需要而经历了不同的突变和选择。  相似文献   

6.
鼠尾草(Salvia japonica)是唇形科(Labiatae)鼠尾草属(Salvia)的一种多年生草本植物,具有十分重要的药用和经济价值。本文采用第二代测序技术Illumina Hiseq平台对鼠尾草的叶绿体基因组进行测序,同时以鼠尾草近缘物种丹参叶绿体基因组作为参考,组装得到完整叶绿体基因组序列。结果表明,鼠尾草叶绿体基因组序列全长153 995 bp,呈典型的四段式结构,其中LSC区长84 573 bp,SSC区长19 874 bp,两个IR区分别长24 774 bp;鼠尾草叶绿体基因组成功注释13组叶绿体基因,基因的种类、数目及GC含量等与唇形科中其它物种较为类似。这些研究结果丰富了鼠尾草属的叶绿体基因组数据,为今后鼠尾草属植物系统发育关系重建积累了基础性数据。  相似文献   

7.
用Trizol从纯化的茶尺蠖Ectropis oblique小RNA病毒(EoPV)中提取病毒基因组RNA,逆转录后加poly(dT),然后进行两步PCR扩增基因组5′端。克隆测序后,对其5′端非编码区的核苷酸序列进行分析,发现具有哺乳动物小RNA病毒的5′端非编码区的一些特征:A/T含量丰富、起始密码子上游AUG和小顺反子多。利用mfold预测了EoPV 5′端非编码区的二级结构,存在4个茎环结构,有哺乳动物内部核糖体进入位点(IRES)的保守区域,即含保守基序GNRA的茎环A和A/C丰富的环B及多聚嘧啶区域。据此推测EoPV基因组翻译采用IRES起始机制。  相似文献   

8.
Apidermin蛋白家族是根据蜜蜂表皮蛋白apidermin 1-3(APD 1-3)而命名的一个新型的昆虫结构性表皮蛋白家族。为了鉴定西方蜜蜂Apis mellifera基因组序列上毗邻基因簇apd 1-3的一个预测基因座LOC727145是否为一个新的apd基因,本研究在用5′LongSAGE标签定位该基因的转录起始位点(TSS)的基础上,利用其中的3条5′LongSAGE标签序列作为上游引物,通过RT-PCR方法克隆了该基因的cDNA序列(GenBank登录号: GU358197, GU358199, GU358198)。生物信息学分析发现,基因座LOC727145含有2个外显子和1个“GT-AG”型内含子,其cDNA序列富含GC(70%),可编码一条长152 aa残基的高度疏水性多肽。此多肽序列的氨基酸组成与蜜蜂APD 1-3表皮蛋白类似, 富含Ala, Gly, Pro, Leu 和Val 5种氨基酸(占77%), 其中Ala残基含量最高(29%)。该多肽序列与蜜蜂APD-1表皮蛋白序列的相似性为50%, 且其N末端的预测信号肽序列与APD 蛋白的信号肽序列类似。5′LongSAGE标签的基因组定位结果显示,基因座LOC727145在雄蜂头部中表达丰度很高,RNA PolⅡ可从6个不同的TSS上以不同效率起始转录,其中由一个优势TSS上起始了90%的转录。本研究为apidermin表皮蛋白家族增添了一个新成员, 命名为apidermin-like (apd-like)。  相似文献   

9.
基于同义密码子偏好分析,对54个原核基因组大、小染色体及质粒中蛋白质编码基因的序列特征进行了对比分析。结果表明,大、小染色体中蛋白质编码基因的GC含量分布相近,质粒中蛋白质编码基因的GC含量分布与所在物种全基因组的GC含量差别较大。进一步的分析表明,大、小染色体共同偏好的密码子最多,且具有相近的起始密码子和终止密码子使用特征。基于对应分析的同义密码子使用模式分析表明,大、小染色体具有相近的序列特征,且大、小染色体及质粒之间具有不尽相同的影响因素。这些结果可为今后原核生物基因组进化研究提供可靠的方法和理论依据。  相似文献   

10.
内源性转录终止子的计算预测是基因转录调控研究的重要内容,但当前方法的预测特异性偏低.在深入分析大肠杆菌内源性终止子中RNA发夹结构和多聚胸腺嘧啶区域等特征信号的基础上,为内源性终止子建立了一个由5个特征变量组成的包含序列组分、局部构象和能量分布信息的特征集,并根据此特征集实现了一种基于支持向量机的内源性终止子计算预测方法.针对大肠杆菌内源性终止子数据集和编码区阴性对照集的六重交叉验证测试证实了预测方法的有效性,对已知数据的预测平均正确率达到了99.4%.在对大肠杆菌全基因组限定范围内的搜索中,该预测方法可以成功地识别出绝大多数已知内源性终止子,与其他几种常用方法相比,预测结果总数大幅度减少,预测的特异性有了明显提高.  相似文献   

11.

Background  

Although it is not difficult for state-of-the-art gene finders to identify coding regions in prokaryotic genomes, exact prediction of the corresponding translation initiation sites (TIS) is still a challenging problem. Recently a number of post-processing tools have been proposed for improving the annotation of prokaryotic TIS. However, inherent difficulties of these approaches arise from the considerable variation of TIS characteristics across different species. Therefore prior assumptions about the properties of prokaryotic gene starts may cause suboptimal predictions for newly sequenced genomes with TIS signals differing from those of well-investigated genomes.  相似文献   

12.
13.
MOTIVATION: Tightly packed prokaryotic genes frequently overlap with each other. This feature, rarely seen in eukaryotic DNA, makes detection of translation initiation sites and, therefore, exact predictions of prokaryotic genes notoriously difficult. Improving the accuracy of precise gene prediction in prokaryotic genomic DNA remains an important open problem. RESULTS: A software program implementing a new algorithm utilizing a uniform Hidden Markov Model for prokaryotic gene prediction was developed. The algorithm analyzes a given DNA sequence in each of six possible global reading frames independently. Twelve complete prokaryotic genomes were analyzed using the new tool. The accuracy of gene finding, predicting locations of protein-coding ORFs, as well as the accuracy of precise gene prediction, and detecting the whole gene including translation initiation codon were assessed by comparison with existing annotation. It was shown that in terms of gene finding, the program performs at least as well as the previously developed tools, such as GeneMark and GLIMMER. In terms of precise gene prediction the new program was shown to be more accurate, by several percentage points, than earlier developed tools, such as GeneMark.hmm, ECOPARSE and ORPHEUS. The results of testing the program indicated the possibility of systematic bias in start codon annotation in several early sequenced prokaryotic genomes. AVAILABILITY: The new gene-finding program can be accessed through the Web site: http:@dixie.biology.gatech.edu/GeneMark/fbf.cgi CONTACT: mark@amber.gatech.edu.  相似文献   

14.
MOTIVATION: At present the computational gene identification methods in microbial genomes have a high prediction accuracy of verified translation termination site (3' end), but a much lower accuracy of the translation initiation site (TIS, 5' end). The latter is important to the analysis and the understanding of the putative protein of a gene and the regulatory machinery of the translation. Improving the accuracy of prediction of TIS is one of the remaining open problems. RESULTS: In this paper, we develop a four-component statistical model to describe the TIS of prokaryotic genes. The model incorporates several features with biological meanings, including the correlation between translation termination site and TIS of genes, the sequence content around the start codon; the sequence content of the consensus signal related to ribosomal binding sites (RBSs), and the correlation between TIS and the upstream consensus signal. An entirely non-supervised training system is constructed, which takes as input a set of annotated coding open reading frames (ORFs) by any gene finder, and gives as output a set of organism-specific parameters (without any prior knowledge or empirical constants and formulas). The novel algorithm is tested on a set of reliable datasets of genes from Escherichia coli and Bacillus subtillis. MED-Start may correctly predict 95.4% of the start sites of 195 experimentally confirmed E.coli genes, 96.6% of 58 reliable B.subtillis genes. Moreover, the test results indicate that the algorithm gives higher accuracy for more reliable datasets, and is robust to the variation of gene length. MED-Start may be used as a postprocessor for a gene finder. After processing by our program, the improvement of gene start prediction of gene finder system is remarkable, e.g. the accuracy of TIS predicted by MED 1.0 increases from 61.7 to 91.5% for 854 E.coli verified genes, while that by GLIMMER 2.02 increases from 63.2 to 92.0% for the same dataset. These results show that our algorithm is one of the most accurate methods to identify TIS of prokaryotic genomes. AVAILABILITY: The program MED-Start can be accessed through the website of CTB at Peking University: http://ctb.pku.edu.cn/main/SheGroup/MED_Start.htm.  相似文献   

15.
Predicting functions of proteins and alternatively spliced isoforms encoded in a genome is one of the important applications of bioinformatics in the post-genome era. Due to the practical limitation of experimental characterization of all proteins encoded in a genome using biochemical studies, bioinformatics methods provide powerful tools for function annotation and prediction. These methods also help minimize the growing sequence-to-function gap. Phylogenetic profiling is a bioinformatics approach to identify the influence of a trait across species and can be employed to infer the evolutionary history of proteins encoded in genomes. Here we propose an improved phylogenetic profile-based method which considers the co-evolution of the reference genome to derive the basic similarity measure, the background phylogeny of target genomes for profile generation and assigning weights to target genomes. The ordering of genomes and the runs of consecutive matches between the proteins were used to define phylogenetic relationships in the approach. We used Escherichia coli K12 genome as the reference genome and its 4195 proteins were used in the current analysis. We compared our approach with two existing methods and our initial results show that the predictions have outperformed two of the existing approaches. In addition, we have validated our method using a targeted protein-protein interaction network derived from protein-protein interaction database STRING. Our preliminary results indicates that improvement in function prediction can be attained by using coevolution-based similarity measures and the runs on to the same scale instead of computing them in different scales. Our method can be applied at the whole-genome level for annotating hypothetical proteins from prokaryotic genomes.  相似文献   

16.
REGANOR     
With >1,000 prokaryotic genome sequencing projects ongoing or already finished, comprehensive comparative analysis of the gene content of these genomes has become viable. To allow for a meaningful comparative analysis, gene prediction of the various genomes should be as accurate as possible. It is clear that improving the state of genome annotation requires automated gene identification methods to cope with the influence of artifacts, such as genomic GC content. There is currently still room for improvement in the state of annotations. We present a web server and a database of high-quality gene predictions. The web server is a resource for gene identification in prokaryote genome sequences. It implements our previously described, accurate gene finding method REGANOR. We also provide novel gene predictions for 241 complete, or almost complete, prokaryotic genomes. We demonstrate how this resource can easily be utilised to identify promising candidates for currently missing genes from genome annotations with several examples. All data sets are available online. AVAILABILITY: The gene finding server is accessible via https://www.cebitec.uni-bielefeld.de/groups/brf/software/reganor/cgi-bin/reganor_upload.cgi. The server software is available with the GenDB genome annotation system (version 2.2.1 onwards) under the GNU general public license. The software can be downloaded from https://sourceforge.net/projects/gendb/. More information on installing GenDB and REGANOR and the system requirements can be found on the GenDB project page http://www.cebitec.uni-bielefeld.de/groups/brf/software/wiki/GenDBWiki/AdministratorDocumentation/GenDBInstallation  相似文献   

17.
18.
Zavala A  Naya H  Romero H  Sabbia V  Piovani R  Musto H 《Gene》2005,357(2):137-143
GC level is a key feature in prokaryotic genomes. Widely employed in evolutionary studies, new insights appear however limited because of the relatively low number of characterized genomes. Since public databases mainly comprise several hundreds of prokaryotes with a low number of sequences per genome, a reliable prediction method based on available sequences may be useful for studies that need a trustworthy estimation of whole genomic GC. As the analysis of completely sequenced genomes shows a great variability in distributional shapes, it is of interest to compare different estimators. Our analysis shows that the mean of GC values of a random sample of genes is a reasonable estimator, based on simplicity of the calculation and overall performance. However, usually sequences come from a process that cannot be considered as random sampling. When we analyzed two introduced sources of bias (gene length and protein functional categories) we were able to detect an additional bias in the estimation for some cases, although the precision was not affected. We conclude that the mean genic GC level of a sample of 10 genes is a reliable estimator of genomic GC content, showing comparable accuracy with many widely employed experimental methods.  相似文献   

19.
Insertion sequences (ISs) play a key role in prokaryotic genome evolution but are seldom well annotated. We describe a web application pipeline, ISsaga (), that provides computational tools and methods for high-quality IS annotation. It uses established ISfinder annotation standards and permits rapid processing of single or multiple prokaryote genomes. ISsaga provides general prediction and annotation tools, information on genome context of individual ISs and a graphical overview of IS distribution around the genome of interest.  相似文献   

20.
The GC contents of 2670 prokaryotic genomes that belong to diverse phylogenetic lineages were analyzed in this paper. These genomes had GC contents that ranged from 13.5% to 74.9%. We analyzed the distance of base frequencies at the three codon positions, codon frequencies, and amino acid compositions across genomes with respect to the differences in the GC content of these prokaryotic species. We found that although the phylogenetic lineages were remote among some species, a similar genomic GC content forced them to adopt similar base usage patterns at the three codon positions, codon usage patterns, and amino acid usage patterns. Our work demonstrates that in prokaryotic genomes: a) base usage, codon usage, and amino acid usage change with GC content with a linear correlation; b) the distance of each usage has a linear correlation with the GC content difference; and c) GC content is more essential than phylogenetic lineage in determining base usage, codon usage, and amino acid usage. This work is exceptional in that we adopted intuitively graphic methods for all analyses, and we used these analyses to examine as many as 2670 prokaryotes. We hope that this work is helpful for understanding common features in the organization of microbial genomes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号