首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
MOTIVATION: Databases of protein families often exhibit drastically different properties of the protein family space. RESULTS: We compared the properties of protein family space as reflected by exhaustive protein family databases and databases with predefined families. We used TRIBES, Protomap, ProDom and COGs as representatives of the exhaustive databases, and Pfam-A and Superfamily as databases that predefine families. We observe a power-law distribution of family sizes in all these databases, albeit in predefined databases the power-law line collapses before reaching smaller sized families. We discuss the future trends of this power-law distribution and suggest that saturation in the sampling of protein family space will result in a distortion of the power law in small family sizes. For larger genome sizes, predefined databases show logarithmic growth of the number of families per genome, whereas exhaustive databases exhibit a virtually linear relationship. All databases consistently differ in the proportion of protein families shared between taxa. Predefined databases have a larger number of protein families shared between the three domains of life, while exhaustive databases show a much more fragmented distribution. We argue that these discrepancies reflect alternative approaches to the trade-off issue of sensitivity versus specificity in the detection of homologous proteins. We conclude that these properties are complementary rather than contradictory, while describing the protein universe from different perspectives.  相似文献   

2.
An efficient algorithm for large-scale detection of protein families   总被引:6,自引:0,他引:6  
Detection of protein families in large databases is one of the principal research objectives in structural and functional genomics. Protein family classification can significantly contribute to the delineation of functional diversity of homologous proteins, the prediction of function based on domain architecture or the presence of sequence motifs as well as comparative genomics, providing valuable evolutionary insights. We present a novel approach called TRIBE-MCL for rapid and accurate clustering of protein sequences into families. The method relies on the Markov cluster (MCL) algorithm for the assignment of proteins into families based on precomputed sequence similarity information. This novel approach does not suffer from the problems that normally hinder other protein sequence clustering algorithms, such as the presence of multi-domain proteins, promiscuous domains and fragmented proteins. The method has been rigorously tested and validated on a number of very large databases, including SwissProt, InterPro, SCOP and the draft human genome. Our results indicate that the method is ideally suited to the rapid and accurate detection of protein families on a large scale. The method has been used to detect and categorise protein families within the draft human genome and the resulting families have been used to annotate a large proportion of human proteins.  相似文献   

3.
Membrane proteins serve as cellular gatekeepers, regulators, and sensors. Prior studies have explored the functional breadth and evolution of proteins and families of particular interest, such as the diversity of transport-associated membrane protein families in prokaryotes and eukaryotes, the composition of integral membrane proteins, and family classification of all human G-protein coupled receptors. However, a comprehensive analysis of the content and evolutionary associations between membrane proteins and families in a diverse set of genomes is lacking. Here, a membrane protein annotation pipeline was developed to define the integral membrane genome and associations between 21,379 proteins from 34 genomes; most, but not all of these proteins belong to 598 defined families. The pipeline was used to provide target input for a structural genomics project that successfully cloned, expressed, and purified 61 of our first 96 selected targets in yeast. Furthermore, the methodology was applied (1) to explore the evolutionary history of the substrate-binding transmembrane domains of the human ABC transporter superfamily, (2) to identify the multidrug resistance-associated membrane proteins in whole genomes, and (3) to identify putative new membrane protein families.  相似文献   

4.
5.
6.
The evolution of RNA editing and pentatricopeptide repeat genes   总被引:1,自引:0,他引:1  
The pentatricopeptide repeat (PPR) is a degenerate 35-amino-acid structural motif identified from analysis of the sequenced genome of the model plant Arabidopsis thaliana. From the wealth of sequence information now available from plant genomes, the PPR protein family is now known to be one of the largest families in angiosperm species, as most genomes encode 400-600 members. As the number of PPR genes is generally only c. 10-20 in other eukaryotic organisms, including green algae, the family has obviously greatly expanded during land plant evolution. This provides a rare opportunity to study selection pressures driving a 50-fold expansion of a single gene family. PPR proteins are sequence-specific RNA-binding proteins involved in many aspects of RNA processing in organelles. In this review, we will summarize our current knowledge about the evolution of PPR genes, and will discuss the relevance of the dramatic expansion in the family to the functional diversification of plant organelles, focusing primarily on RNA editing.  相似文献   

7.
Rational classification of proteins encoded in sequenced genomes is critical for making the genome sequences maximally useful for functional and evolutionary studies. The family of DNA-binding proteins is one of the most populated and studied amongst the various genomes of bacteria, archaea and eukaryotes and the Web-based system presented here is an approach to their classification. The DnaProt resource is an annotated and searchable collection of protein sequences for the families of DNA-binding proteins. The database contains 3238 full-length sequences (retrieved from the SWISS-PROT database, release 38) that include, at least, a DNA-binding domain. Sequence entries are organized into families defined by PROSITE patterns, PRINTS motifs and de novo excised signatures. Combining global similarities and functional motifs into a single classification scheme, DNA-binding proteins are classified into 33 unique classes, which helps to reveal comprehensive family relationships. To maximize family information retrieval, DnaProt contains a collection of multiple alignments for each DNA-binding family while the recognized motifs can be used as diagnostically functional fingerprints. All available structural class representatives have been referenced. The resource was developed as a Web-based management system for online free access of customized data sets. Entries are fully hyperlinked to facilitate easy retrieval of the original records from the source databases while functional and phylogenetic annotation will be applied to newly sequenced genomes. The database is freely available for online search of a library containing specific patterns of the identified DNA-binding protein classes and retrieval of individual entries from our WWW server (http://kronos.biol.uoa.gr/~mariak/dbDNA.html).  相似文献   

8.
Of the ~4000 ORFs identified through the genome sequence of Mycobacterium tuberculosis (TB) H37Rv, experimentally determined structures are available for 312. Since knowledge of protein structures is essential to obtain a high-resolution understanding of the underlying biology, we seek to obtain a structural annotation for the genome, using computational methods. Structural models were obtained and validated for ~2877 ORFs, covering ~70% of the genome. Functional annotation of each protein was based on fold-based functional assignments and a novel binding site based ligand association. New algorithms for binding site detection and genome scale binding site comparison at the structural level, recently reported from the laboratory, were utilized. Besides these, the annotation covers detection of various sequence and sub-structural motifs and quaternary structure predictions based on the corresponding templates. The study provides an opportunity to obtain a global perspective of the fold distribution in the genome. The annotation indicates that cellular metabolism can be achieved with only 219 folds. New insights about the folds that predominate in the genome, as well as the fold-combinations that make up multi-domain proteins are also obtained. 1728 binding pockets have been associated with ligands through binding site identification and sub-structure similarity analyses. The resource (http://proline.physics.iisc.ernet.in/Tbstructuralannotation), being one of the first to be based on structure-derived functional annotations at a genome scale, is expected to be useful for better understanding of TB and for application in drug discovery. The reported annotation pipeline is fairly generic and can be applied to other genomes as well.  相似文献   

9.
Several GTP-binding proteins with poorly defined functions were previously identified in Escherichia coli (i.e. Era, ThdF (TrmE)), Bacillus subtilis (i.e. Obg) and Neisseria gonorrhoeae (i.e. EngA). In these species, every individual protein is encoded by an essential gene. BLAST searches were used to detect orthologs in genomes of various organisms. Alignments of orthologous sequences allowed the construction of phylogenetic trees and the definition of protein families. The BLAST searches also resulted in the identification of two additional families, the YchF and YihA families, named after the ychF and yihA genes of E. coli. Most families are not present in archaeal genomes, but representatives of each family were also detected in eukaryotic genomes. Only representatives of the YchF family are present in every genome sequenced to date, suggesting that YchF-like proteins might be involved in a fundamental life process. The GTP1/DRG family consisting of eukaryotic and archaeal proteins is related to the YchF family of GTP-binding proteins. The relationship of the six prokaryotic families of GTP-binding proteins and the GTP1/DRG family to eukaryotic GTPase families was also investigated: With the exception of the ARF family, a clear separation of the six prokaryotic families and the GTP1/DRG family with respect to eukaryotic (RAB, RAN, RAS and RHO) GTPases was observed.  相似文献   

10.
Gene order in prokaryotes is conserved to a much lesser extent than protein sequences. Only some operons, primarily those that encode physically interacting proteins, are conserved in all or most of the bacterial and archaeal genomes. Nevertheless, even the limited conservation of operon organisation that is observed provides valuable evolutionary and functional clues through multiple genome comparisons. With the rapid growth in the number and diversity of sequenced prokaryotic genomes, functional inferences for uncharacterized genes located in the same conserved gene neighborhood with well-studied genes are becoming increasingly important. In this review, we discuss various computational approaches for identification of conserved gene strings and construction of local alignments of gene orders in prokaryotic genomes.  相似文献   

11.
Summary All modern mammals contain a distinctive, highly repeated (⩾50,000 members) family of long interspersed repeated DNA called the L1 (LINE 1) family. While the modern L1 families were derived from a common ancestor that predated the mammalian radiation ∼80 million years ago, most of the members of these families were generated within the last 5 million years. However, recently we demonstrated that modern murine (Old World rats and mice) genomes share an older long interspersed repeated DNA family that we called Lx. Here we report our analysis of the DNA sequence of Lx family members and the relationship of this family to the modern L1 families in mouse and rat. The extent of DNA sequence divergence between Lx members indicates that the Lx amplification occurred about 12 million years ago, around the time of the murine radiation. Parsimony analysis revealed that Lx elements were ancestral to both the modern rat and mouse L1 families. However, we found that few if any of the evolutionary intermediates between the Lx and the modern L1 families were extensively amplified. Because the modern L1 families have evolved under selective pressure, the evolutionary intermediates must have been capable of replication. Therefore, replicationcompetent L1 elements can reside in genomes without undergoing extensive amplification. We discuss the bearing of our findings on the evolution of L1 DNA elements and the mammalian genome.  相似文献   

12.
We present an analysis of 203 completed genomes in the Gene3D resource (including 17 eukaryotes), which demonstrates that the number of protein families is continually expanding over time and that singleton-sequences appear to be an intrinsic part of the genomes. A significant proportion of the proteomes can be assigned to fewer than 6000 well-characterized domain families with the remaining domain-like regions belonging to a much larger number of small uncharacterized families that are largely species specific. Our comprehensive domain annotation of 203 genomes enables us to provide more accurate estimates of the number of multi-domain proteins found in the three kingdoms of life than previous calculations. We find that 67% of eukaryotic sequences are multi-domain compared with 56% of sequences in prokaryotes. By measuring the domain coverage of genome sequences, we show that the structural genomics initiatives should aim to provide structures for less than a thousand structurally uncharacterized Pfam families to achieve reasonable structural annotation of the genomes. However, in large families, additional structures should be determined as these would reveal more about the evolution of the family and enable a greater understanding of how function evolves.  相似文献   

13.
VY Muley  A Ranjan 《PloS one》2012,7(7):e42057

Background

Recent progress in computational methods for predicting physical and functional protein-protein interactions has provided new insights into the complexity of biological processes. Most of these methods assume that functionally interacting proteins are likely to have a shared evolutionary history. This history can be traced out for the protein pairs of a query genome by correlating different evolutionary aspects of their homologs in multiple genomes known as the reference genomes. These methods include phylogenetic profiling, gene neighborhood and co-occurrence of the orthologous protein coding genes in the same cluster or operon. These are collectively known as genomic context methods. On the other hand a method called mirrortree is based on the similarity of phylogenetic trees between two interacting proteins. Comprehensive performance analyses of these methods have been frequently reported in literature. However, very few studies provide insight into the effect of reference genome selection on detection of meaningful protein interactions.

Methods

We analyzed the performance of four methods and their variants to understand the effect of reference genome selection on prediction efficacy. We used six sets of reference genomes, sampled in accordance with phylogenetic diversity and relationship between organisms from 565 bacteria. We used Escherichia coli as a model organism and the gold standard datasets of interacting proteins reported in DIP, EcoCyc and KEGG databases to compare the performance of the prediction methods.

Conclusions

Higher performance for predicting protein-protein interactions was achievable even with 100–150 bacterial genomes out of 565 genomes. Inclusion of archaeal genomes in the reference genome set improves performance. We find that in order to obtain a good performance, it is better to sample few genomes of related genera of prokaryotes from the large number of available genomes. Moreover, such a sampling allows for selecting 50–100 genomes for comparable accuracy of predictions when computational resources are limited.  相似文献   

14.
Lee D  Grant A  Marsden RL  Orengo C 《Proteins》2005,59(3):603-615
Using a new protocol, PFscape, we undertake a systematic identification of protein families and domain architectures in 120 complete genomes. PFscape clusters sequences into protein families using a Markov clustering algorithm (Enright et al., Nucleic Acids Res 2002;30:1575-1584) followed by complete linkage clustering according to sequence identity. Within each protein family, domains are recognized using a library of hidden Markov models comprising CATH structural and Pfam functional domains. Domain architectures are then determined using DomainFinder (Pearl et al., Protein Sci 2002;11:233-244) and the protein family and domain architecture data are amalgamated in the Gene3D database (Buchan et al., Genome Res 2002;12:503-514). Using Gene3D, we have investigated protein sequence space, the extent of structural annotation, and the distribution of different domain architectures in completed genomes from all kingdoms of life. As with earlier studies by other researchers, the distribution of domain families shows power-law behavior such that the largest 2,000 domain families can be mapped to approximately 70% of nonsingleton genome sequences; the remaining sequences are assigned to much smaller families. While approximately 50% of domain annotations within a genome are assigned to 219 universal domain families, a much smaller proportion (< 10%) of protein sequences are assigned to universal protein families. This supports the mosaic theory of evolution whereby domain duplication followed by domain shuffling gives rise to novel domain architectures that can expand the protein functional repertoire of an organism. Functional data (e.g. COG/KEGG/GO) integrated within Gene3D result in a comprehensive resource that is currently being used in structure genomics initiatives and can be accessed via http://www.biochem.ucl.ac.uk/bsm/cath/Gene3D/.  相似文献   

15.
Predatory bacteria are taxonomically disparate, exhibit diverse predatory strategies and are widely distributed in varied environments. To date, their predatory phenotypes cannot be discerned in genome sequence data thereby limiting our understanding of bacterial predation, and of its impact in nature. Here, we define the ‘predatome,'' that is, sets of protein families that reflect the phenotypes of predatory bacteria. The proteomes of all sequenced 11 predatory bacteria, including two de novo sequenced genomes, and 19 non-predatory bacteria from across the phylogenetic and ecological landscapes were compared. Protein families discriminating between the two groups were identified and quantified, demonstrating that differences in the proteomes of predatory and non-predatory bacteria are large and significant. This analysis allows predictions to be made, as we show by confirming from genome data an over-looked bacterial predator. The predatome exhibits deficiencies in riboflavin and amino acids biosynthesis, suggesting that predators obtain them from their prey. In contrast, these genomes are highly enriched in adhesins, proteases and particular metabolic proteins, used for binding to, processing and consuming prey, respectively. Strikingly, predators and non-predators differ in isoprenoid biosynthesis: predators use the mevalonate pathway, whereas non-predators, like almost all bacteria, use the DOXP pathway. By defining predatory signatures in bacterial genomes, the predatory potential they encode can be uncovered, filling an essential gap for measuring bacterial predation in nature. Moreover, we suggest that full-genome proteomic comparisons are applicable to other ecological interactions between microbes, and provide a convenient and rational tool for the functional classification of bacteria.  相似文献   

16.
The PEDANT genome database (http://pedant.gsf.de) provides exhaustive automatic analysis of genomic sequences by a large variety of established bioinformatics tools through a comprehensive Web-based user interface. One hundred and seventy seven completely sequenced and unfinished genomes have been processed so far, including large eukaryotic genomes (mouse, human) published recently. In this contribution, we describe the current status of the PEDANT database and novel analytical features added to the PEDANT server in 2002. Those include: (i) integration with the BioRS data retrieval system which allows fast text queries, (ii) pre-computed sequence clusters in each complete genome, (iii) a comprehensive set of tools for genome comparison, including genome comparison tables and protein function prediction based on genomic context, and (iv) computation and visualization of protein-protein interaction (PPI) networks based on experimental data. The availability of functional and structural predictions for 650 000 genomic proteins in well organized form makes PEDANT a useful resource for both functional and structural genomics.  相似文献   

17.
InterPro (http://www.ebi.ac.uk/interpro/) is an integrated documentation resource for protein families, domains and sites, developed initially as a means of rationalizing the complementary efforts of the PROSITE, PRINTS, Pfam and ProDom database projects. It is a useful resource that aids the functional classification of proteins. Almost 90% of the actinopterygii protein sequences from SWISS-PROT and TrEMBL can be classified using InterPro. Over 30% of the actinopterygii protein sequences currently in SWISS-PROT and TrEMBL are of mitochondrial origin, the majority of which belong to the cytochrome b/b6 family. InterPro also gives insights into the domain composition of the classified proteins and has applications in the functional classification of newly determined sequences lacking biochemical characterization, and in comparative genome analysis. A comparison of the actinopterygii protein sequences against the sequences of other eukaryotes confirms the high representation of eukaryotic protein kinase in the organisms studied. The comparisons also show that, based on InterPro families, the trans-species evolution of MHC class I and II molecules in mammals and teleost fish can be recognized.  相似文献   

18.
19.
Increasingly complex bioinformatic analysis is necessitated by the plethora of sequence information currently available. A total of 21 poxvirus genomes have now been completely sequenced and annotated, and many more genomes will be available in the next few years. First, we describe the creation of a database of continuously corrected and updated genome sequences and an easy-to-use and extremely powerful suite of software tools for the analysis of genomes, genes, and proteins. These tools are available free to all researchers and, in most cases, alleviate the need for using multiple Internet sites for analysis. Further, we describe the use of these programs to identify conserved families of genes (poxvirus orthologous clusters) and have named the software suite POCs, which is available at www.poxvirus.org. Using POCs, we have identified a set of 49 absolutely conserved gene families-those which are conserved between the highly diverged families of insect-infecting entomopoxviruses and vertebrate-infecting chordopoxviruses. An additional set of 41 gene families conserved in chordopoxviruses was also identified. Thus, 90 genes are completely conserved in chordopoxviruses and comprise the minimum essential genome, and these will make excellent drug, antibody, vaccine, and detection targets. Finally, we describe the use of these tools to identify necessary annotation and sequencing updates in poxvirus genomes. For example, using POCs, we identified 19 genes that were widely conserved in poxviruses but missing from the vaccinia virus strain Tian Tan 1998 GenBank file. We have reannotated and resequenced fragments of this genome and verified that these genes are conserved in Tian Tan. The results for poxvirus genes and genomes are discussed in light of evolutionary processes.  相似文献   

20.
The Toll/interleukin-1 receptor (TIR) domain is found in one of the two large families of homologues of plant disease resistance proteins (R proteins) in Arabidopsis and other dicotyledonous plants. In addition to these TIR-NBS-LRR (TNL) R proteins, we identified two families of TIR-containing proteins encoded in the Arabidopsis Col-0 genome. The TIR-X (TX) family of proteins lacks both the nucleotide-binding site (NBS) and the leucine rich repeats (LRRs) that are characteristic of the R proteins, while the TIR-NBS (TN) proteins contain much of the NBS, but lack the LRR. In Col-0, the TX family is encoded by 27 genes and three pseudogenes; the TN family is encoded by 20 genes and one pseudogene. Using massively parallel signature sequencing (MPSS), expression was detected at low levels for approximately 85% of the TN-encoding genes. Expression was detected for only approximately 40% of the TX-encoding genes, again at low levels. Physical map data and phylogenetic analysis indicated that multiple genomic duplication events have increased the numbers of TX and TN genes in Arabidopsis. Genes encoding TX, TN and TNL proteins were demonstrated in conifers; TX and TN genes are present in very low numbers in grass genomes. The expression, prevalence, and diversity of TX and TN genes suggests that these genes encode functional proteins rather than resulting from degradation or deletions of TNL genes. These TX and TN proteins could be plant analogues of small TIR-adapter proteins that function in mammalian innate immune responses such as MyD88 and Mal.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号