首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 946 毫秒
1.
Full-length messenger RNA sequences greatly improve genome annotation   总被引:3,自引:0,他引:3  
Haas BJ  Volfovsky N  Town CD  Troukhan M  Alexandrov N  Feldmann KA  Flavell RB  White O  Salzberg SL 《Genome biology》2002,3(6):research0029.1-research002912
  相似文献   

2.
The goal of the Gene Ontology (GO) project is to provide a uniform way to describe the functions of gene products from organisms across all kingdoms of life and thereby enable analysis of genomic data. Protein annotations are either based on experiments or predicted from protein sequences. Since most sequences have not been experimentally characterized, most available annotations need to be based on predictions. To make as accurate inferences as possible, the GO Consortium's Reference Genome Project is using an explicit evolutionary framework to infer annotations of proteins from a broad set of genomes from experimental annotations in a semi-automated manner. Most components in the pipeline, such as selection of sequences, building multiple sequence alignments and phylogenetic trees, retrieving experimental annotations and depositing inferred annotations, are fully automated. However, the most crucial step in our pipeline relies on software-assisted curation by an expert biologist. This curation tool, Phylogenetic Annotation and INference Tool (PAINT) helps curators to infer annotations among members of a protein family. PAINT allows curators to make precise assertions as to when functions were gained and lost during evolution and record the evidence (e.g. experimentally supported GO annotations and phylogenetic information including orthology) for those assertions. In this article, we describe how we use PAINT to infer protein function in a phylogenetic context with emphasis on its strengths, limitations and guidelines. We also discuss specific examples showing how PAINT annotations compare with those generated by other highly used homology-based methods.  相似文献   

3.
MOTIVATION: Any development of new methods for automatic functional annotation of proteins according to their sequences requires high-quality data (as benchmark) as well as tedious preparatory work to generate sequence parameters required as input data for the machine learning methods. Different program settings and incompatible protocols make a comparison of the analyzed methods difficult. RESULTS: The MIPS Bacterial Functional Annotation Benchmark dataset (MIPS-BFAB) is a new, high-quality resource comprising four bacterial genomes manually annotated according to the MIPS functional catalogue (FunCat). These resources include precalculated sequence parameters, such as sequence similarity scores, InterPro domain composition and other parameters that could be used to develop and benchmark methods for functional annotation of bacterial protein sequences. These data are provided in XML format and can be used by scientists who are not necessarily experts in genome annotation. AVAILABILITY: BFAB is available at http://mips.gsf.de/proj/bfab  相似文献   

4.
5.
We have developed a rice (Oryza sativa) genome annotation database (Osa1) that provides structural and functional annotation for this emerging model species. Using the sequence of O. sativa subsp. japonica cv Nipponbare from the International Rice Genome Sequencing Project, pseudomolecules, or virtual contigs, of the 12 rice chromosomes were constructed. Our most recent release, version 3, represents our third build of the pseudomolecules and is composed of 98% finished sequence. Genes were identified using a series of computational methods developed for Arabidopsis (Arabidopsis thaliana) that were modified for use with the rice genome. In release 3 of our annotation, we identified 57,915 genes, of which 14,196 are related to transposable elements. Of these 43,719 non-transposable element-related genes, 18,545 (42.4%) were annotated with a putative function, 5,777 (13.2%) were annotated as encoding an expressed protein with no known function, and the remaining 19,397 (44.4%) were annotated as encoding a hypothetical protein. Multiple splice forms (5,873) were detected for 2,538 genes, resulting in a total of 61,250 gene models in the rice genome. We incorporated experimental evidence into 18,252 gene models to improve the quality of the structural annotation. A series of functional data types has been annotated for the rice genome that includes alignment with genetic markers, assignment of gene ontologies, identification of flanking sequence tags, alignment with homologs from related species, and syntenic mapping with other cereal species. All structural and functional annotation data are available through interactive search and display windows as well as through download of flat files. To integrate the data with other genome projects, the annotation data are available through a Distributed Annotation System and a Genome Browser. All data can be obtained through the project Web pages at http://rice.tigr.org.  相似文献   

6.

Background

The yaws treponemes, Treponema pallidum ssp. pertenue (TPE) strains, are closely related to syphilis causing strains of Treponema pallidum ssp. pallidum (TPA). Both yaws and syphilis are distinguished on the basis of epidemiological characteristics, clinical symptoms, and several genetic signatures of the corresponding causative agents.

Methodology/Principal Findings

To precisely define genetic differences between TPA and TPE, high-quality whole genome sequences of three TPE strains (Samoa D, CDC-2, Gauthier) were determined using next-generation sequencing techniques. TPE genome sequences were compared to four genomes of TPA strains (Nichols, DAL-1, SS14, Chicago). The genome structure was identical in all three TPE strains with similar length ranging between 1,139,330 bp and 1,139,744 bp. No major genome rearrangements were found when compared to the four TPA genomes. The whole genome nucleotide divergence (dA) between TPA and TPE subspecies was 4.7 and 4.8 times higher than the observed nucleotide diversity (π) among TPA and TPE strains, respectively, corresponding to 99.8% identity between TPA and TPE genomes. A set of 97 (9.9%) TPE genes encoded proteins containing two or more amino acid replacements or other major sequence changes. The TPE divergent genes were mostly from the group encoding potential virulence factors and genes encoding proteins with unknown function.

Conclusions/Significance

Hypothetical genes, with genetic differences, consistently found between TPE and TPA strains are candidates for syphilitic treponemes virulence factors. Seventeen TPE genes were predicted under positive selection, and eleven of them coded either for predicted exported proteins or membrane proteins suggesting their possible association with the cell surface. Sequence changes between TPE and TPA strains and changes specific to individual strains represent suitable targets for subspecies- and strain-specific molecular diagnostics.  相似文献   

7.
8.
This paper introduces the genome annotating proteomic pipeline (GAPP), a totally automated publicly available software pipeline for the identification of peptides and proteins from human proteomic tandem mass spectrometry data. The pipeline takes as its input a series of MS/MS peak lists from a given experimental sample and produces a series of database entries corresponding to the peptides observed within the sample, along with related confidence scores. The pipeline is capable of finding any peptides expected, including those that cross intron-exon boundaries, and those due to single nucleotide polymorphisms (SNPs), alternate splicing, and post-translational modifications (PTMs). GAPP can therefore be used to re-annotate genomes, and this is supported through the inclusion of a Distributed Annotation System (DAS) server, which allows the peptides identified by the pipeline to be displayed in their genomic context within the Ensembl genome browser. GAPP is freely available via the web, at www. gapp.info.  相似文献   

9.
ABSTRACT: BACKGROUND: Most major genome projects and sequence databases provide a GO annotation of their data,either automatically or through human annotators, creating a large corpus of data written inthe language of GO. Texts written in natural language show a statistical power law behaviour,Zipf's law, the exponent of which can provide useful information on the nature of thelanguage being used. We have therefore explored the hypothesis that collections of GOannotations will show similar statistical behaviours to natural language. RESULTS: Annotations from the Gene Ontology Annotation project were found to follow Zipf's law.Surprisingly, the measured power law exponents were consistently different betweenannotation captured using the three GO sub-ontologies in the corpora (function, process andcomponent). On filtering the corpora using GO evidence codes we found that the value of themeasured power law exponent responded in a predictable way as a function of the evidencecodes used to support the annotation. CONCLUSIONS: Techniques from computational linguistics can provide new insights into the annotationprocess. GO annotations show similar statistical behaviours to those seen in natural languagewith measured exponents that provide a signal which correlates with the nature of the evidence codes used to support the annotations, suggesting that the measured exponent mightprovide a signal regarding the information content of the annotation.  相似文献   

10.
MOTIVATION: Contigs-Assembly and Annotation Tool-Box (CAAT-Box) is a software package developed for the computational part of a genome project where the sequence is obtained by a shotgun strategy. CAAT-Box contains new tools to predict links between contigs by using similarity searches with other whole genome sequences. Most importantly, it allows annotation of a genome to commence during the finishing phase using a gene-oriented strategy. For this purpose, CAAT-Box creates an Individual Protein file (IPF) for each ORF of an assembly. The nucleotide sequence reported in an IPF corresponds to the sequence of the ORF with 500 additional bases before the ORF and 200 bases after. For annotation, additional information like Blast results can be added or linked to the IPFs as well as automatic and/or manual annotations. When a new assembly is performed, CAAT-Box creates new IPFs according to the old IPF panel. CAAT-Box recognizes the modified IPFs which are the only ones used for a new automatic analysis after each assembly. Using this strategy, the user works with a group of IPFs independently of the closure phase progression. The IPFs are accessible by a web server and can therefore be modified and commented by different groups. RESULT: CAAT-Box was used to obtain and to annotate several complete genomes like Listeria monocytogenes or Streptococcus agalactiae. AVAILABILITY: The program may be obtained from the authors and is freely available to non-profit organisations.  相似文献   

11.
The Swiss-Prot protein knowledgebase provides manually annotated entries for all species, but concentrates on the annotation of entries from model organisms to ensure the presence of high quality annotation of representative members of all protein families. A specific Plant Protein Annotation Program (PPAP) was started to cope with the increasing amount of data produced by the complete sequencing of plant genomes. Its main goal is the annotation of proteins from the model plant organism Arabidopsis thaliana. In addition to bibliographic references, experimental results, computed features and sometimes even contradictory conclusions, direct links to specialized databases connect amino acid sequences with the current knowledge in plant sciences. As protein families and groups of plant-specific proteins are regularly reviewed to keep up with current scientific findings, we hope that the wealth of information of Arabidopsis origin accumulated in our knowledgebase, and the numerous software tools provided on the Expert Protein Analysis System (ExPASy) web site might help to identify and reveal the function of proteins originating from other plants. Recently, a single, centralized, authoritative resource for protein sequences and functional information, UniProt, was created by joining the information contained in Swiss-Prot, Translation of the EMBL nucleotide sequence (TrEMBL), and the Protein Information Resource-Protein Sequence Database (PIR-PSD). A rising problem is that an increasing number of nucleotide sequences are not being submitted to the public databases, and thus the proteins inferred from such sequences will have difficulties finding their way to the Swiss-Prot or TrEMBL databases.  相似文献   

12.
Badhwar J  Karri S  Cass CK  Wunderlich EL  Znosko BM 《Biochemistry》2007,46(50):14715-14724
Thermodynamic data for RNA 1 x 2 nucleotide internal loops are lacking. Thermodynamic data that are available for 1 x 2 loops, however, are for loops that rarely occur in nature. In order to identify the most frequently occurring 1 x 2 nucleotide internal loops, a database of 955 RNA secondary structures was compiled and searched. Twenty-four RNA duplexes containing the most common 1 x 2 nucleotide loops were optically melted, and the thermodynamic parameters DeltaH degrees , DeltaS degrees , DeltaG degrees 37, and TM for each duplex were determined. This data set more than doubles the number of 1 x 2 nucleotide loops previously studied. A table of experimental free energy contributions for frequently occurring 1 x 2 nucleotide loops (as opposed to a predictive model) is likely to result in better prediction of RNA secondary structure from sequence. In order to improve free energy calculations for duplexes containing 1 x 2 nucleotide loops that do not have experimental free energy contributions, the data collected here were combined with data from 21 previously studied 1 x 2 loops. Using linear regression, the entire dataset was used to derive nearest neighbor parameters that can be used to predict the thermodynamics of previously unmeasured 1 x 2 nucleotide loops. The DeltaG degrees 37,loop and DeltaH degrees loop nearest neighbor parameters derived here were compared to values that were published previously for 1 x 2 nucleotide loops but were derived from either a significantly smaller dataset of 1 x 2 nucleotide loops or from internal loops of various sizes [Lu, Z. J., Turner, D. H., and Mathews, D. H. (2006) Nucleic Acids Res. 34, 4912-4924]. Most of these values were found to be within experimental error, suggesting that previous approximations and assumptions associated with the derivation of those nearest neighbor parameters were valid. DeltaS degrees loop nearest neighbor parameters are also reported for 1 x 2 nucleotide loops. Both the experimental thermodynamics and the nearest neighbor parameters reported here can be used to improve secondary structure prediction from sequence.  相似文献   

13.
The number of large-scale experimental datasets generated from high-throughput technologies has grown rapidly. Biological knowledge resources such as the Gene Ontology Annotation (GOA) database, which provides high-quality functional annotation to proteins within the UniProt Knowledgebase, can play an important role in the analysis of such data. The integration of GOA with analytical tools has proved to aid the clustering, annotation and biological interpretation of such large expression datasets. GOA is also useful in the development and validation of automated annotation tools, in particular text-mining systems. The increasing interest in GOA highlights the great potential of this freely available resource to assist both the biological research and bioinformatics communities.  相似文献   

14.

Background

Treponema pallidum ssp. pallidum (TPA), the causative agent of syphilis, and Treponema pallidum ssp. pertenue (TPE), the causative agent of yaws, are closely related spirochetes causing diseases with distinct clinical manifestations. The TPA Mexico A strain was isolated in 1953 from male, with primary syphilis, living in Mexico. Attempts to cultivate TPA Mexico A strain under in vitro conditions have revealed lower growth potential compared to other tested TPA strains.

Methodology/Principal Findings

The complete genome sequence of the TPA Mexico A strain was determined using the Illumina sequencing technique. The genome sequence assembly was verified using the whole genome fingerprinting technique and the final sequence was annotated. The genome size of the Mexico A strain was determined to be 1,140,038 bp with 1,035 predicted ORFs. The Mexico A genome sequence was compared to the whole genome sequences of three TPA (Nichols, SS14 and Chicago) and three TPE (CDC-2, Samoa D and Gauthier) strains. No large rearrangements in the Mexico A genome were found and the identified nucleotide changes occurred most frequently in genes encoding putative virulence factors. Nevertheless, the genome of the Mexico A strain, revealed two genes (TPAMA_0326 (tp92) and TPAMA_0488 (mcp2-1)) which combine TPA- and TPE- specific nucleotide sequences. Both genes were found to be under positive selection within TPA strains and also between TPA and TPE strains.

Conclusions/Significance

The observed mosaic character of the TPAMA_0326 and TPAMA_0488 loci is likely a result of inter-strain recombination between TPA and TPE strains during simultaneous infection of a single host suggesting horizontal gene transfer between treponemal subspecies.  相似文献   

15.
In the past few years, the field of metagenomics has been growing at an accelerated pace, particularly in response to advancements in new sequencing technologies. The large volume of sequence data from novel organisms generated by metagenomic projects has triggered the development of specialized databases and tools focused on particular groups of organisms or data types. Here we describe a pipeline for the functional annotation of viral metagenomic sequence data. The Viral MetaGenome Annotation Pipeline (VMGAP) pipeline takes advantage of a number of specialized databases, such as collections of mobile genetic elements and environmental metagenomes to improve the classification and functional prediction of viral gene products. The pipeline assigns a functional term to each predicted protein sequence following a suite of comprehensive analyses whose results are ranked according to a priority rules hierarchy. Additional annotation is provided in the form of enzyme commission (EC) numbers, GO/MeGO terms and Hidden Markov Models together with supporting evidence.  相似文献   

16.
Determination of netropsin-DNA binding constants from footprinting data   总被引:9,自引:0,他引:9  
A theory for deriving drug-DNA site binding constants from footprinting data is presented. Plots of oligonucleotide concentration, as a function of drug concentration, for various cutting positions on DNA are required. It is assumed that the rate of cleavage at each nucleotide position is proportional to the concentration of enzyme at that nucleotide and to the probability that the nucleotide is not blocked by drug. The probability of a nucleotide position not being blocked is calculated by assuming a conventional binding equilibrium for each binding site with exclusions for overlapping sites. The theory has been used to evaluate individual site binding constants for the antiviral agent netropsin toward a 139 base pair restriction fragment of pBR-322 DNA. Drug binding constants, evaluated from footprinting data in the presence of calf thymus DNA and poly(dGdC) as carrier and in the absence of carrier DNA, were determined by obtaining the best fit between calculated and experimental footprinting data. Although the strong sites on the fragment were all of the type (T.A)4, the value of the binding constant was strongly sequence dependent. Sites containing the dinucleotide sequence 5'-TA-3' were found to have significantly lower binding constants than those without this sequence, suggesting that an adenine-adenine clash produces a DNA structural alteration in the minor groove which discourages netropsin binding to DNA. The errors, scope, and limitations associated with the method are presented and discussed.  相似文献   

17.
Annotating the genome of Medicago truncatula   总被引:3,自引:0,他引:3  
Medicago truncatula will be among the first plant species to benefit from the completion of a whole-genome sequencing project. For each of these species, Arabidopsis, rice and now poplar and Medicago, annotation, the process of identifying gene structures and defining their functions, is essential for the research community to benefit from the sequence data generated. Annotation of the Arabidopsis genome involved gene-by-gene curation of the entire genome, but the larger genomes of rice, Medicago and other species necessitate the automation of the annotation process. Profiting from the experience gained from previous whole-genome efforts, a uniform set of Medicago gene annotations has been generated by coordinated international effort and, along with other views of the genome data, has been provided to the research community at several websites.  相似文献   

18.
19.
The nucleotide sequence of a recombinant DNA clone, containing a partial mRNA sequence for human α-fetoprotein (AFP) in the plasmid vector pBR322, has been determined. Two regions of the cloned nucleotide sequence were found to agree with published amino acid sequences of two cyanogen bromide peptides derived from human AFP. Examination of the amino acid sequence, deduced from the cloned portion of the mRNA coding region, reveals extensive homology with the third domain of the human serum albumin molecule. A total of 44% ( ) amino acids and 54% ( ) nucleotides are identical in the two structures. The landmark cysteine residues are found in the same positions in both polypeptide chains, presumably forming the same disulfide bridges in AFP as those found in the albumin. The sequence homology reinforces the evidence that human AFP and albumin constitute a gene family, in analogy to the same family found in rodents. A comparison of the human and rodent sequence data suggests that the rate of molecular evolution has been faster for AFP than for albumin.  相似文献   

20.
Dicentrarchus labrax is one of the major marine aquaculture species in the European Union. In this study, we have developed a directed-sequencing strategy to sequence three sea bass chromosomes and compared results with other teleosts.Three BAC DNA pools were created from sea bass BAC clones that mapped to stickleback chromosomes/groups V, XVII and XXI. The pools were sequenced to 17-39x coverage by pyrosequencing. Data assembly was supported by Sanger reads and mate pair data and resulted in superscaffolds of 13.2 Mb, 17.5 Mb and 13.7 Mb respectively. Annotation features of the superscaffolds include 1477 genes. We analyzed size change of exon, intron and intergenic sequence between teleost species and deduced a simple model for the evolution of genome composition in teleost lineage.Combination of second generation sequencing technologies, Sanger sequencing and genome partitioning strategies allows “high-quality draft assemblies” of chromosome-sized superscaffolds, which are crucial for the prediction and annotation of complete genes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号