首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Genome annotation projects can produce incorrect results if they are based on obsolete data or inappropriate models. We have developed an automatic re-annotation system that uses agents to perform repetitive tasks and reports the results to the user. These tasks involve BLAST searches on biological databases (GenBank) and the use of detection tools (Genemark and Glimmer) to identify new open reading frames. Several agents execute these tools and combine their results to produce a list of open reading frames that is sent back to the user. Our goal was to reduce the manual work, executing most tasks automatically by computational tools. A prototype was implemented and validated using Mycoplasma pneumoniae and Haemophilus influenzae original annotated genomes. The results reported by the system identify most of new features present in the re-annotated versions of these genomes.  相似文献   

2.
3.
The accelerated rate of genomic sequencing has led to an abundance of completely sequenced genomes. Annotation of the open reading frames (ORFs) (i.e., gene prediction) in these genomes is an important task and is most often performed computationally based on features in the nucleic acid sequence. Using recent advances in proteomics, we set out to predict the set of ORFs for an organism based principally on expressed protein-based evidence. Using a novel search strategy, we mapped peptides detected in a whole-cell lysate of Mycoplasma pneumoniae onto a genomic scaffold and extended these "hits" into ORFs bound by traditional genetic signals to generate a "proteogenomic map". We were able to generate an ORF model for M. pneumoniae strain FH using proteomic data with a high correlation to models based on sequence features. Ultimately, we detected over 81% of the genomically predicted ORFs in M. pneumoniae strain M129 (the originally sequenced strain). We were also able to detect several new ORFs not originally predicted by genomic methods, various N-terminal extensions, and some evidence that would suggest that certain predicted ORFs are bogus. Some of these differences may be a result of the strain analyzed but demonstrate the robustness of protein analysis across closely related genomes. This technique is a cost-effective means to add value to genome annotation, and a prerequisite for proteome quantitation and in vivo interaction measures.  相似文献   

4.
Hepatitis E virus (HEV) is a major causative agent of acute hepatitis in developing countries. The Norway rat HEV genome consists of six open reading frames (ORFs), i.e., ORF1, ORF2, ORF3, ORF4, ORF5 and ORF6. The additional reading frame encoded protein ORF5 is attributed to life cycle of rat HEV. The ORFF5 protein’s function remains undetermined. Therefore, it is of interest to analyze the ORF5 protein for its physiochemical properties, primary structure, secondary structure, tertiary structure and functional characteristics using bioinformatics tools. Analysis of the ORF5 protein revealed it as highly unstable, hydrophilic with basic pI. The ORF5 protein consisted mostly of Arg, Pro, Ser, Leu and Gly. The 3D structural homology model of the ORF5 protein generated showed mixed α/β structural fold with predominance of coils. Structural analysis revealed the presence of clefts, pores and a tunnel. This data will help in the sequence, structure and functional annotation of ORF5.  相似文献   

5.
Evaluation of annotation strategies using an entire genome sequence   总被引:2,自引:0,他引:2  
MOTIVATION: Genome-wide functional annotation either by manual or automatic means has raised considerable concerns regarding the accuracy of assignments and the reproducibility of methodologies. In addition, a performance evaluation of automated systems that attempt to tackle sequence analyses rapidly and reproducibly is generally missing. In order to quantify the accuracy and reproducibility of function assignments on a genome-wide scale, we have re-annotated the entire genome sequence of Chlamydia trachomatis (serovar D), in a collaborative manner. RESULTS: We have encoded all annotations in a structured format to allow further comparison and data exchange and have used a scale that records the different levels of potential annotation errors according to their propensity to propagate in the database due to transitive function assignments. We conclude that genome annotation may entail a considerable amount of errors, ranging from simple typographical errors to complex sequence analysis problems. The most surprising result of this comparative study is that automatic systems might perform as well as the teams of experts annotating genome sequences.  相似文献   

6.
Laboratories working with draft phase genomes have specific software needs, such as the unattended processing of hundreds of single scaffolds and subsequent sequence annotation. In addition, it is critical to follow the "movement" and the manual annotation of single open reading frames (ORFs) within the successive sequence updates. Even with finished genomes, regular database updates can lead to significant changes in the annotation of single ORFs. In functional genomics it is important to mine data and identify new genetic targets rapidly and easily. Often there is no need for sophisticated relational databases (RDB) that greatly reduce the system-independent access of the results. Another aspect is the internet dependency of most software packages. If users are working with confidential data, this dependency poses a security issue. GAMOLA was designed to handle the numerous scaffolds and changing contents of draft phase genomes in an automated process and stores the results for each predicted ORF in flatfile databases. In addition, annotation transfers, ORF designation tracking, Blast comparisons, and primer design for whole genome microarrays have been implemented. The software is available under the license of North Carolina State University. A website and a downloadable example are accessible under (http://fsweb2.schaub. ncsu.edu/TRKwebsite/index.htm).  相似文献   

7.
Although the annotation of the complete genome sequence of Mycoplasma pneumoniae did not reveal a bacterial type I signal peptidase (SPase I) we showed experimentally that such an activity must exist in this bacterium, by determining the N-terminus of the N-terminal gene product P40 of MPN142, formerly called ORF6 gene. Combining mass spectrometry with a method for sulfonating specifically the free amino terminal group of proteins, the cleavage site for a typical signal peptide was located between amino acids 25 and 26 of the P40 precursor protein. The experimental results were in agreement with the cleavage site predicted by computational methods providing experimental confirmation for these theoretical analyses.  相似文献   

8.
Roseophage SIO1 is a lytic marine phage that infects Roseobacter SIO67, a member of the Roseobacter clade of near-shore alphaproteobacteria. Roseophage SIO1 was first isolated in 1989 and sequenced in 2000. We have re-sequenced and re-annotated the original isolate. Our current annotation could only assign functions to seven additional open reading frames, indicating that, despite the advances in bioinformatics tools and increased genomic resources, we are still far from being able to translate phage genomic sequences into biological functions. In 2001, we isolated four new strains of Roseophage SIO1 from California near-shore locations. The genomes of all four were sequenced and compared against the original Roseophage SIO1 isolated in 1989. A high degree of conservation was evident across all five genomes; comparisons at the nucleotide level yielded an average 97% identity. The observed differences were clustered in protein-encoding regions and were mostly synonymous. The one strain that was found to possess an expanded host range also showed notable changes in putative tail protein-coding regions. Despite the possibly rapid evolution of phage and the mostly uncharacterized diversity found in viral metagenomic data sets, these findings indicate that viral genomes such as the genome of SIO1-like Roseophages can be stably maintained over ecologically significant time and distance (i.e. over a decade and ∼50 km).  相似文献   

9.
Sequence and organization of barley yellow dwarf virus genomic RNA.   总被引:23,自引:5,他引:18       下载免费PDF全文
The nucleotide sequence of the genomic RNA of barley yellow dwarf virus, PAV serotype was determined, except for the 5'-terminal base, and its genome organization deduced. The 5,677 nucleotide genome contains five large open reading frames (ORFs). The genes for the coat protein (1) and the putative viral RNA-dependent RNA polymerase were identified. The latter shows a striking degree of similarity to that of carnation mottle virus (CarMV). By comparison with corona- and retrovirus RNAs, it is proposed that a translational frameshift is involved in expression of the polymerase. An ORF encoding an Mr 49,797 protein (50K ORF) may be translated by in-frame readthrough of the coat protein stop codon. The coat protein, an overlapping 17K ORF, and a 3'6.7K ORF are likely to be expressed via subgenomic mRNAs.  相似文献   

10.
The 3374 nucleotide sequence of RNA2 from the British PEBV strain SP5 has been determined. The RNA includes three open reading frames flanked by 5' and 3' noncoding regions of 509 and 480 nucleotides. The open reading frames specify coat protein, a 29.6K product homologous to the 29.1K product of TRV(TCM) RNA2 and a 23K product not homologous to any previously described protein. The homology demonstrated between the coat proteins of PRV, TRV and PEBV indicates a common evolutionary origin for these proteins. Upstream of each ORF are located sequences homologous to those with which subgenomic RNAs of other tobraviruses start. Subgenomic RNAs for the expression of the three ORFs may start at these points. On all five tobraviral RNA2 molecules sequenced to date, these sequences were found upstream of the coat protein ORF in association with a strongly-conserved potential secondary structural element. Similar potential structures were identified upstream of other tobraviral ORFs. These structures may contribute to the activity of the tobraviral subgenomic promoter.  相似文献   

11.
The proteins expressed by Francisella tularensis subsp. novicida U112 grown to midexponential phase were surveyed by nanoLC-tandem mass spectrometry (LC-MS/MS). To improve annotation of the genome and develop a technology to provide high-throughput analysis of the Francisella proteome in multiple conditions, we sought to establish a fast and simple analysis that would reduce as much as possible the false discovery rate. Our survey detected expression of 63.0% of the predicted proteome from the stable condition of growth in rich medium available at (www.francisella.org). On the basis of detection of essential proteins, we estimated coverage to be approximately 80% of the actual expressed proteome. This suggests that no less than 70% of the proteins could be expressed in this condition. This analysis revealed two previously unidentified protein coding open reading frames and validated 50% of the proteins annotated as hypothetical. On the basis of results of the screen to detect essential proteins, not all proteins expressed provide a measurable contribution to F.t. novicida growth in this condition. Comparison of this protein profile with other profiles previously published suggested that the genome size and number of genes involved in regulation have little effect on the number of proteins expressed in a given stable condition.  相似文献   

12.
As more and more complete bacterial genome sequences become available, the genome annotation of previously sequenced genomes may become quickly outdated. This is primarily due to the discovery and functional characterization of new genes. We have reannotated the recently published genome of Shewanella oneidensis with the following results: 51 new genes have been identified, and functional annotation has been added to the 97 genes, including 15 new and 82 existing ones with previously unassigned function. The identification of new genes was achieved by predicting the protein coding regions using the HMM-based program GeneMark.hmm. Subsequent comparison of the predicted gene products to the non-redundant protein database using BLAST and the COG (Clusters of Orthologous Groups) database using COGNITOR provided for the functional annotation.  相似文献   

13.
14.
EcoGene: a genome sequence database for Escherichia coli K-12   总被引:5,自引:1,他引:4       下载免费PDF全文
The EcoGene database provides a set of gene and protein sequences derived from the genome sequence of Escherichia coli K-12. EcoGene is a source of re-annotated sequences for the SWISS-PROT and Colibri databases. EcoGene is used for genetic and physical map compilations in collaboration with the Coli Genetic Stock Center. The EcoGene12 release includes 4293 genes. EcoGene12 differs from the GenBank annotation of the complete genome sequence in several ways, including (i) the revision of 706 predicted or confirmed gene start sites, (ii) the correction or hypothetical reconstruction of 61 frame-shifts caused by either sequence error or mutation, (iii) the reconstruction of 14 protein sequences interrupted by the insertion of IS elements, and (iv) pre-dictions that 92 genes are partially deleted gene fragments. A literature survey identified 717 proteins whose N-terminal amino acids have been verified by sequencing. 12 446 cross-references to 6835 literature citations and s are provided. EcoGene is accessible at a new website: http://bmb.med.miami.edu/EcoGene/EcoWeb. Users can search and retrieve individual EcoGene GenePages or they can download large datasets for incorporation into database management systems, facilitating various genome-scale computational and functional analyses.  相似文献   

15.
SARS-CoV-2 genome annotation revealed the presence of 10 open reading frames (ORFs), of which the last one (ORF10) is positioned downstream of the N gene. It is a hypothetical gene, which was speculated to encode a 38 aa protein. This hypothetical protein does not share sequence similarity with any other known protein and cannot be associated with a function. While the role of this ORF10 was proposed, there is growing evidence showing that the ORF10 is not a coding region. Here, we identified SARS-CoV-2 variants in which the ORF10 gene was prematurely terminated. The disease was not attenuated, and the transmissibility between humans was maintained. Also, in vitro, the strains replicated similarly to the related viruses with the intact ORF10. Altogether, based on clinical observation and laboratory analyses, it appears that the ORF10 protein is not essential in humans. This observation further proves that the ORF10 should not be treated as the protein-coding gene, and the genome annotations should be amended.  相似文献   

16.
Large-scale prokaryotic gene prediction and comparison to genome annotation   总被引:4,自引:0,他引:4  
MOTIVATION: Prokaryotic genomes are sequenced and annotated at an increasing rate. The methods of annotation vary between sequencing groups. It makes genome comparison difficult and may lead to propagation of errors when questionable assignments are adapted from one genome to another. Genome comparison either on a large or small scale would be facilitated by using a single standard for annotation, which incorporates a transparency of why an open reading frame (ORF) is considered to be a gene. RESULTS: A total of 143 prokaryotic genomes were scored with an updated version of the prokaryotic genefinder EasyGene. Comparison of the GenBank and RefSeq annotations with the EasyGene predictions reveals that in some genomes up to approximately 60% of the genes may have been annotated with a wrong start codon, especially in the GC-rich genomes. The fractional difference between annotated and predicted confirms that too many short genes are annotated in numerous organisms. Furthermore, genes might be missing in the annotation of some of the genomes. We predict 41 of 143 genomes to be over-annotated by >5%, meaning that too many ORFs are annotated as genes. We also predict that 12 of 143 genomes are under-annotated. These results are based on the difference between the number of annotated genes not found by EasyGene and the number of predicted genes that are not annotated in GenBank. We argue that the average performance of our standardized and fully automated method is slightly better than the annotation.  相似文献   

17.
In this paper, we re-annotated the genome of Pyrobaculum aerophilum str. IM2, particularly for hypothetical ORFs. The annotation process includes three parts. Firstly and most importantly, 23 new genes, which were missed in the original annotation, are found by combining similarity search and the ab initio gene finding approaches. Among these new genes, five have significant similarities with function-known genes and the rest have significant similarities with hypothetical ORFs contained in other genomes. Secondly, the coding potentials of the 1645 hypothetical ORFs are re-predicted by using 33 Z curve variables combined with Fisher linear discrimination method. With the accuracy being 99.68%, 25 originally annotated hypothetical ORFs are recognized as non-coding by our method. Thirdly, 80 hypothetical ORFs are assigned with potential functions by using similarity search with BLAST program. Re-annotation of the genome will benefit related researches on this hyperthermophilic crenarchaeon. Also, the re-annotation procedure could be taken as a reference for other archaeal genomes. Details of the revised annotation are freely available at http://cobi.uestc.edu.cn/resource/paero/  相似文献   

18.
The complete nucleotide sequence of RNA beta from the type strain of barley stripe mosaic virus (BSMV) has been determined. The sequence is 3289 nucleotides in length and contains four open reading frames (ORFs) which code for proteins of Mr 22,147 (ORF1), Mr 58,098 (ORF2), Mr 17,378 (ORF3), and Mr 14,119 (ORF4). The predicted N-terminal amino acid sequence of the polypeptide encoded by the ORF nearest the 5'-end of the RNA (ORF1) is identical (after the initiator methionine) to the published N-terminal amino acid sequence of BSMV coat protein for 29 of the first 30 amino acids. ORF2 occupies the central portion of the coding region of RNA beta and ORF3 is located at the 3'-end. The ORF4 sequence overlaps the 3'-region of ORF2 and the 5'-region of ORF3 and differs in codon usage from the other three RNA beta ORFs. The coding region of RNA beta is followed by a poly(A) tract and a 238 nucleotide tRNA-like structure which are common to all three BSMV genomic RNAs.  相似文献   

19.
20.
M Price 《Journal of virology》1992,66(9):5658-5661
Nucleotide sequence analysis of potato virus X (PVX) genomic RNA predicts five open reading frames (ORFs). Previous analysis of total RNAs from PVX-infected leaf tissue suggested that six subgenomic RNAs are synthesized during infection. However, the proteins encoded by the genomic RNA, the subgenomic RNAs, or the predicted ORFs have not been identified in vivo. To characterize the coding properties of the viral RNA, particularly to determine whether the five predicted ORFs function in vivo, total protein extracts prepared from PVX-infected leaf tissue were analyzed by using antibodies raised against virus-specific synthetic peptides and against the virus capsid protein. Dot blot analyses showed that these antibodies reacted to PVX-infected extracts, indicating in vivo expression of the five predicted ORFs. In addition, Western blot (immunoblot) analysis of the extracts showed that ORF 1, 2, 3, and 4 peptide antisera and coat protein antiserum detect predominantly a single protein.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号