首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The PANTHER database was designed for high-throughput analysis of protein sequences. One of the key features is a simplified ontology of protein function, which allows browsing of the database by biological functions. Biologist curators have associated the ontology terms with groups of protein sequences rather than individual sequences. Statistical models (Hidden Markov Models, or HMMs) are built from each of these groups. The advantage of this approach is that new sequences can be automatically classified as they become available. To ensure accurate functional classification, HMMs are constructed not only for families, but also for functionally distinct subfamilies. Multiple sequence alignments and phylogenetic trees, including curator-assigned information, are available for each family. The current version of the PANTHER database includes training sequences from all organisms in the GenBank non-redundant protein database, and the HMMs have been used to classify gene products across the entire genomes of human, and Drosophila melanogaster. The ontology terms and protein families and subfamilies, as well as Drosophila gene c;assifications, can be browsed and searched for free. Due to outstanding contractual obligations, access to human gene classifications and to protein family trees and multiple sequence alignments will temporarily require a nominal registration fee. PANTHER is publicly available on the web at http://panther.celera.com.  相似文献   

2.
MOTIVATION: The SWISS-PROT sequence database contains keywords of functional annotations for many proteins. In contrast, information about the sub-cellular localization is available for only a few proteins. Experts can often infer localization from keywords describing protein function. We developed LOCkey, a fully automated method for lexical analysis of SWISS-PROT keywords that assigns sub-cellular localization. With the rapid growth in sequence data, the biochemical characterisation of sequences has been falling behind. Our method may be a useful tool for supplementing functional information already automatically available. RESULTS: The method reached a level of more than 82% accuracy in a full cross-validation test. Due to a lack of functional annotations, we could infer localization for fewer than half of all proteins in SWISS-PROT. We applied LOCkey to annotate five entirely sequenced proteomes, namely Saccharomyces cerevisiae (yeast), Caenorhabditis elegans (worm), Drosophila melanogaster (fly), Arabidopsis thaliana (plant) and a subset of all human proteins. LOCkey found about 8000 new annotations of sub-cellular localization for these eukaryotes.  相似文献   

3.
4.
VIDA is a new virus database that organizes open reading frames (ORFs) from partial and complete genomic sequences from animal viruses. Currently VIDA includes all sequences from GenBank for Herpesviridae, Coronaviridae and Arteriviridae. The ORFs are organized into homologous protein families, which are identified on the basis of sequence similarity relationships. Conserved sequence regions of potential functional importance are identified and can be retrieved as sequence alignments. We use a controlled taxonomical and functional classification for all the proteins and protein families in the database. When available, protein structures that are related to the families have also been included. The database is available for online search and sequence information retrieval at http://www.biochem.ucl.ac.uk/bsm/virus_database/ VIDA.html.  相似文献   

5.
MOTIVATION: Multiple chitinases as well as lectins closely related to them have been characterized previously from many insect species and the corresponding genes/cDNAs have been cloned. However, the identification of the entire assortment of genes for chitinase family proteins and their differences in biochemical properties have not been carried out in any individual insect species. The completion of the entire DNA sequence of Drosophila melanogaster (fruit fly) genome and identification of open reading frames presents an opportunity to study the structures and functions of chitinase-like proteins, and also to identify new members of this family in DROSOPHILA: We are, therefore, interested in studying the functional genomics of chitinase-like gene families in insects. METHODS: We searched the Drosophila protein sequences database using fully characterized insect chitinase sequences and BLASTP software, identified all the putative chitinase-like proteins encoded in Drosophila genome, and predicted their structures using domain analysis tools. A phylogenetic analysis of the chitinase-like proteins from Drosophila and several other insect species was carried out. The structures of these chitinases were modeled using homology modeling software. RESULTS: Our analysis revealed the presence of 18 chitinase-like proteins in the Drosophila protein database. Among these are seven novel chitinase-like proteins that contain four signature amino acid sequences of chitinases belonging to family 18 glycosylhydrolases, including both acidic and hydrophobic amino acid residues critical for enzyme activity. All the proteins contain at least one catalytic domain with one having four catalytic domains. Phylogenetic analysis of chitinase-like proteins from Drosophila and other insects revealed an evolutionary relationship among all these proteins, which indicated gene duplication and domain shuffling to generate the observed diversity in the encoded proteins. Homology modeling showed that all the Drosophila chitinase-like proteins contain one or more catalytic domains with a (alpha/beta)8 barrel-like structure. Our results suggest that insects utilize multiple family 18 chitinolytic enzymes and also non-enzymatic chitinase-like proteins for degrading/remodeling/binding to chitin in different insect anatomical extracellular structures, such as the cuticle, peritrophic membrane, trachea and mouth parts during insect development, and possibly for other roles including chitin synthesis. AVAILABILITY: Perl program and supplementary material are available at http://www.ksu.edu/bioinformatics/supplementary.htm  相似文献   

6.
Babnigg G  Giometti CS 《Proteomics》2006,6(16):4514-4522
In proteome studies, identification of proteins requires searching protein sequence databases. The public protein sequence databases (e.g., NCBInr, UniProt) each contain millions of entries, and private databases add thousands more. Although much of the sequence information in these databases is redundant, each database uses distinct identifiers for the identical protein sequence and often contains unique annotation information. Users of one database obtain a database-specific sequence identifier that is often difficult to reconcile with the identifiers from a different database. When multiple databases are used for searches or the databases being searched are updated frequently, interpreting the protein identifications and associated annotations can be problematic. We have developed a database of unique protein sequence identifiers called Sequence Globally Unique Identifiers (SEGUID) derived from primary protein sequences. These identifiers serve as a common link between multiple sequence databases and are resilient to annotation changes in either public or private databases throughout the lifetime of a given protein sequence. The SEGUID Database can be downloaded (http://bioinformatics.anl.gov/SEGUID/) or easily generated at any site with access to primary protein sequence databases. Since SEGUIDs are stable, predictions based on the primary sequence information (e.g., pI, Mr) can be calculated just once; we have generated approximately 500 different calculations for more than 2.5 million sequences. SEGUIDs are used to integrate MS and 2-DE data with bioinformatics information and provide the opportunity to search multiple protein sequence databases, thereby providing a higher probability of finding the most valid protein identifications.  相似文献   

7.
The use of fluorescent protein tags has had a huge impact on cell biological studies in virtually every experimental system. Incorporation of coding sequence for fluorescent proteins such as green fluorescent protein (GFP) into genes at their endogenous chromosomal position is especially useful for generating GFP-fusion proteins that provide accurate cellular and subcellular expression data. We tested modifications of a transposon-based protein trap screening procedure in Drosophila to optimize the rate of recovering useful protein traps and their analysis. Transposons carrying the GFP-coding sequence flanked by splice acceptor and donor sequences were mobilized, and new insertions that resulted in production of GFP were captured using an automated embryo sorter. Individual stocks were established, GFP expression was analyzed during oogenesis, and insertion sites were determined by sequencing genomic DNA flanking the insertions. The resulting collection includes lines with protein traps in which GFP was spliced into mRNAs and embedded within endogenous proteins or enhancer traps in which GFP expression depended on splicing into transposon-derived RNA. We report a total of 335 genes associated with protein or enhancer traps and a web-accessible database for viewing molecular information and expression data for these genes.  相似文献   

8.
We have previously shown that the detection of gene fusion events can contribute towards the elucidation of functional associations of proteins within entire genomes. Here we have analysed the entire genome of Drosophila melanogaster using fusion analysis and two additional constraints that improve the reliability of the predictions, viz. low sequence similarity and low degree of paralogy of the component proteins involved in a fusion event. Imposing these constraints, the total number of unique component pairs is reduced from 18 654 to a mere 220 cases, which are expected to represent some of the most reliably detected functionally associated proteins. Using additional information from sequence databases, we have been able to detect pairs of functionally associated proteins with important functions in cellular and developmental pathways, such as spermatogenesis and programmed cell death.  相似文献   

9.
PGTdb: a database providing growth temperatures of prokaryotes   总被引:6,自引:0,他引:6  
Included in Prokaryotic Growth Temperature database (PGTdb) are a total of 1334 temperature data from 1072 prokaryotic organisms, Bacteria and Archaea: PGTdb integrates microbial growth temperature data from literature survey with their nucleotide/protein sequence and protein structure data from related databases. A direct correlation is observed between the average growth temperature of an organism and the melting temperature of proteins from the organism. Therefore, this database is useful not only for microbiologists to obtain cultivation condition, but also for biochemists and structure biologists to study the correlation between protein sequences/structures and their thermostability. In addition, the taxonomy and ribosomal RNA sequence(s) of an organism are linked through NCBI Taxonomy and the Ribosomal RNA Operon Copy Number Database umdb, respectively. PGTdb is the only integrated database on the Internet to provide the growth temperature data of the prokaryotes and the combined information of their nucleotide/protein sequences, protein structures, taxonomy and phylogeny. AVAILABILITY: http://pgtdb.csie.ncu.edu.tw  相似文献   

10.
MOTIVATION: Biological sequence databases are highly redundant for two main reasons: 1. various databanks keep redundant sequences with many identical and nearly identical sequences 2. natural sequences often have high sequence identities due to gene duplication. We wanted to know how many sequences can be removed before the databases start losing homology information. Can a database of sequences with mutual sequence identity of 50% or less provide us with the same amount of biological information as the original full database? RESULTS: Comparisons of nine representative sequence databases (RSDB) derived from full protein databanks showed that the information content of sequence databases is not linearly proportional to its size. An RSDB reduced to mutual sequence identity of around 50% (RSDB50) was equivalent to the original full database in terms of the effectiveness of homology searching. It was a third of the full database size which resulted in a six times faster iterative profile searching. The RSDBs are produced at different granularity for efficient homology searching. AVAILABILITY: All the RSDB files generated and the full analysis results are available through internet: ftp://ftp.ebi.ac. uk/pub/contrib/jong/RSDB/http://cyrah.e bi.ac.uk:1111/Proj/Bio/RSDB  相似文献   

11.
SUMMARY: We present a Web server where the SYSTERS cluster set of the non-redundant protein database consisting of sequences from SWISS-PROT and PIR is being made available for querying and browsing. The cluster set can be searched with a new sequence using the SSMAL search tool. Additionally, a multiple alignment is generated for each cluster and annotated with domain information from the Pfam protein family database. AVAILABILITY: The server address is http://www.dkfz-heidelberg.de/tbi/services/cluster/ systersform  相似文献   

12.
MOTIVATION: Protein sequence and family data is accumulating at such a rapid rate that state-of-the-art databases and interface tools are required to aid curators with their classifications. We have designed such a system, MetaFam, to facilitate the comparison and integration of public protein sequence and family data. This paper presents the global schema, integration issues, and query capabilities of MetaFam. RESULTS: MetaFam is an integrated data warehouse of information about protein families and their sequences. This data has been collected into a consistent global schema, and stored in an Oracle relational database. The warehouse implementation allows for quick removal of outdated data sets. In addition to the relational implementation of the primary schema, we have developed several derived tables that enable efficient access from data visualization and exploration tools. Through a series of straightforward SQL queries, we demonstrate the usefulness of this data warehouse for comparing protein family classifications and for functional assignment of new sequences.  相似文献   

13.
14.
MOTIVATION: A few years ago, FlyBase undertook to design a new database schema to store Drosophila data. It would fully integrate genomic sequence and annotation data with bibliographic, genetic, phenotypic and molecular data from the literature representing a distillation of the first 100 years of research on this major animal model system. In developing this new integrated schema, FlyBase also made a commitment to ensure that its design was generic, extensible and available as open source, so that it could be employed as the core schema of any model organism data repository, thereby avoiding redundant software development and potentially increasing interoperability. Our question was whether we could create a relational database schema that would be successfully reused. RESULTS: Chado is a relational database schema now being used to manage biological knowledge for a wide variety of organisms, from human to pathogens, especially the classes of information that directly or indirectly can be associated with genome sequences or the primary RNA and protein products encoded by a genome. Biological databases that conform to this schema can interoperate with one another, and with application software from the Generic Model Organism Database (GMOD) toolkit. Chado is distinctive because its design is driven by ontologies. The use of ontologies (or controlled vocabularies) is ubiquitous across the schema, as they are used as a means of typing entities. The Chado schema is partitioned into integrated subschemas (modules), each encapsulating a different biological domain, and each described using representations in appropriate ontologies. To illustrate this methodology, we describe here the Chado modules used for describing genomic sequences. AVAILABILITY: GMOD is a collaboration of several model organism database groups, including FlyBase, to develop a set of open-source software for managing model organism data. The Chado schema is freely distributed under the terms of the Artistic License (http://www.opensource.org/licenses/artistic-license.php) from GMOD (www.gmod.org).  相似文献   

15.
UniProt archive     
UniProt Archive (UniParc) is the most comprehensive, non-redundant protein sequence database available. Its protein sequences are retrieved from predominant, publicly accessible resources. All new and updated protein sequences are collected and loaded daily into UniParc for full coverage. To avoid redundancy, each unique sequence is stored only once with a stable protein identifier, which can be used later in UniParc to identify the same protein in all source databases. When proteins are loaded into the database, database cross-references are created to link them to the origins of the sequences. As a result, performing a sequence search against UniParc is equivalent to performing the same search against all databases cross-referenced by UniParc. UniParc contains only protein sequences and database cross-references; all other information must be retrieved from the source databases.  相似文献   

16.
FlgM proteins, also known as Anti-sigma-28 factor (sigma28), are negative regulators of flagellin synthesis. Recently, a three-dimensional structure of the Aquifex aeolicus sigma28/FlgM complex (PDB code: 1rp3) was determined by X-ray crystallography at 2.3 A resolution. Furthermore, experimental data on bacterial FlgM, including site-directed mutagenesis and structural characterization by NMR are also available. However, an interpretation of the sequence-structure-function relationships combining X-ray and NMR data with the evolutionary information extracted from the increasing number of FlgM-related sequences annotated in databases is not available. In the present study, we combined database sequence searches and sequence-analysis tools to update the multiple sequence alignment of a previously characterized cluster of orthologs (COG2747) and the PFAM classification of protein domains (PF04316) for the FlgM family. A phylogenetic analysis of 77 protein sequences revealed the presence of at least three major sequence clades within the FlgM family. Besides, we predicted functional residues using a SequenceSpace method. We also generated homology models for Bacillus subtilis and Salmonella typhimurium FlgM proteins, for which sequence-structure-function relationship data are available, and used the docking program ClusPro to hypothesize about the dimer association between FlgM proteins. In conclusion, the analysis presented in this work will be useful in designing new experiments to understand better protein-protein interactions between FglM, sigma factors, and putative molecules from the flagellar export apparatus. Electronic Supplementary Material is available in the online version of this article at http://link.springer.de/  相似文献   

17.
Histone Sequence Database: new histone fold family members.   总被引:2,自引:0,他引:2       下载免费PDF全文
Searches of the major public protein databases with core and linker chicken and human histone sequences have resulted in the compilation of an annotated set of histone protein sequences. In addition, new database searches with two distinct motif search algorithms have identified several members of the histone fold family, including human DRAP1 and yeast CSE4. Database resources include information on conflicts between similar sequence entries in different source databases, multiple sequence alignments, links to the Entrez integrated information retrieval system, structures for histone and histone fold proteins, and the ability to visualize structural data through Cn3D. The database currently contains >1000 protein sequences, which are searchable by protein type, accession number, organism name, or any other free text appearing in the definition line of the entry. All sequences and alignments in this database are available through the World Wide Web at http://www.nhgri.nih. gov/DIR/GTB/HISTONES or http://www.ncbi.nlm.nih. gov/Baxevani/HISTONES  相似文献   

18.
UniRef: comprehensive and non-redundant UniProt reference clusters   总被引:2,自引:0,他引:2  
MOTIVATION: Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences. RESULTS: The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef100, UniRef90 and UniRef50 yield a database size reduction of approximately 10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis. AVAILABILITY: UniRef is updated biweekly and is available for online search and retrieval at http://www.uniprot.org, as well as for download at ftp://ftp.uniprot.org/pub/databases/uniprot/uniref. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

19.
The insect allatostatins are neurohormones, acting on the corpora allata (where they block the release of juvenile hormone) and on the insect gut (where they block smooth muscle contraction). We screened the "Drosophila Genome Project" database with electronic sequences corresponding to various insect allatostatins. This resulted in alignment with a DNA sequence coding for some Drosophila allatostatins (drostatins). Using PCR with oligonucleotide primers directed against the presumed exons of this Drosophila allatostatin gene and subsequent 3'- and 5'-RACE, we were able to clone its cDNA. The Drosophila allatostatin preprohormone contains four amino acid sequences that after processing would give rise to four Drosophila allatostatins: Val-Glu-Arg-Tyr-Ala-Phe-Gly-Leu-NH(2) (drostatin-1), Leu-Pro-Val-Tyr-Asn-Phe-Gly-Leu-NH(2) (drostatin-2), Ser-Arg-Pro-Tyr-Ser-Phe-Gly-Leu-NH(2) (drostatin-3), and Thr-Thr-Arg-Pro-Gln-Pro-Phe-Asn-Phe-Gly-Leu-NH(2) (drostatin-4). Drostatin-2 is identical to helicostatin-2 (11-18) and drostatin-3 to helicostatin-3, two neurohormones previously isolated from the moth Helicoverpa armigera. Furthermore, drostatin-3 has previously been isolated from Drosophila itself. Drostatins-1 and -4 are novel members of the insect allatostatin neuropeptide family. The Drosophila allatostatin preprohormone gene contains two introns and three exons. The gene is located on the right arm of the third chromosome, position 96A-B. The existence of at least four different Drosophila allatostatins opens the possibility of a differential action of some of these hormones on the two recently cloned Drosophila allatostatin receptors, DAR-1 and -2. This is the first report on an allatostatin preprohormone from Drosophila.  相似文献   

20.
Automated genome sequence analysis and annotation.   总被引:5,自引:0,他引:5  
MOTIVATION: Large-scale genome projects generate a rapidly increasing number of sequences, most of them biochemically uncharacterized. Research in bioinformatics contributes to the development of methods for the computational characterization of these sequences. However, the installation and application of these methods require experience and are time consuming. RESULTS: We present here an automatic system for preliminary functional annotation of protein sequences that has been applied to the analysis of sets of sequences from complete genomes, both to refine overall performance and to make new discoveries comparable to those made by human experts. The GeneQuiz system includes a Web-based browser that allows examination of the evidence leading to an automatic annotation and offers additional information, views of the results, and links to biological databases that complement the automatic analysis. System structure and operating principles concerning the use of multiple sequence databases, underlying sequence analysis tools, lexical analyses of database annotations and decision criteria for functional assignments are detailed. The system makes automatic quality assessments of results based on prior experience with the underlying sequence analysis tools; overall error rates in functional assignment are estimated at 2.5-5% for cases annotated with highest reliability ('clear' cases). Sources of over-interpretation of results are discussed with proposals for improvement. A conservative definition for reporting 'new findings' that takes account of database maturity is presented along with examples of possible kinds of discoveries (new function, family and superfamily) made by the system. System performance in relation to sequence database coverage, database dynamics and database search methods is analysed, demonstrating the inherent advantages of an integrated automatic approach using multiple databases and search methods applied in an objective and repeatable manner. AVAILABILITY: The GeneQuiz system is publicly available for analysis of protein sequences through a Web server at http://www.sander.ebi.ac. uk/gqsrv/submit  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号