首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Histone and histone fold sequences and structures: a database.   总被引:4,自引:3,他引:1       下载免费PDF全文
A database of aligned histone protein sequences has been constructed based on the results of homology searches of the major public sequence databases. In addition, sequences of proteins identified as containing the histone fold motif and structures of all known histone and histone fold proteins have been included in the current release. Database resources include information on conflicts between similar sequence entries in different source databases, multiple sequence alignments, and links to the Entrez integrated information retrieval system at the National Center for Biotechnology Information (NCBI). The database currently contains over 1000 protein sequences. All sequences and alignments in this database are available through the World Wide Web at: http: //www.ncbi.nlm.nih.gov/Baxevani/HISTONES/ .  相似文献   

2.
Restriction endonucleases and other nucleic acid cleaving enzymes form a large and extremely diverse superfamily that display little sequence similarity despite retaining a common core fold responsible for cleavage. The lack of significant sequence similarity between protein families makes homology inference a challenging task and hinders new family identification with traditional sequence-based approaches. Using the consensus fold recognition method Meta-BASIC that combines sequence profiles with predicted protein secondary structure, we identify nine new restriction endonuclease-like fold families among previously uncharacterized proteins and predict these proteins to cleave nucleic acid substrates. Application of transitive searches combined with gene neighborhood analysis allow us to confidently link these unknown families to a number of known restriction endonuclease-like structures and thus assign folds to the uncharacterized proteins. Finally, our method identifies a novel restriction endonuclease-like domain in the C-terminus of RecC that is not detected with structure-based searches of the existing PDB database.  相似文献   

3.
Many proteins involved in key biological processes are modular in nature. A group of these, the beta-propeller proteins, fold by packing 4-stranded beta-sheets in a circular array. The members of this group are increasingly numerous and, although their modular building blocks all preserve the same basic conformation, they do not have similar sequences. These proteins have extreme functional and phylogenetic diversity. Here, features of the beta-propeller fold are reviewed through comparisons of available structural coordinates. Structure-based sequence alignments combined with analyses of superpositions of individual modular units reveal conserved general features such as hydrogen bonds, beta-turns and positions of hydrophobic contacts. The lack of significant sequence identity is compensated by sets of interactions which stabilise the fold differently in distinct structures. Re-occurring aspartates make contacts to exposed backbone amides in turns or peptide connections within the same sheet. The sole factor responsible for the number of sheets that assemble in the array is the size of the hydrophobic residues that pack into the cores between the sheets. Whilst there is no overall sequence conservation, it may be possible to detect new members of this fold through sequence searches that take into account the repeated nature of the modular assembly as well as the positions of hydrophobic residues and H-bonding side chains.  相似文献   

4.
Protein functional annotation relies on the identification of accurate relationships, sequence divergence being a key factor. This is especially evident when distant protein relationships are demonstrated only with three-dimensional structures. To address this challenge, we describe a computational approach to purposefully bridge gaps between related protein families through directed design of protein-like “linker” sequences. For this, we represented SCOP domain families, integrated with sequence homologues, as multiple profiles and performed HMM-HMM alignments between related domain families. Where convincing alignments were achieved, we applied a roulette wheel-based method to design 3,611,010 protein-like sequences corresponding to 374 SCOP folds. To analyze their ability to link proteins in homology searches, we used 3024 queries to search two databases, one containing only natural sequences and another one additionally containing designed sequences. Our results showed that augmented database searches showed up to 30% improvement in fold coverage for over 74% of the folds, with 52 folds achieving all theoretically possible connections. Although sequences could not be designed between some families, the availability of designed sequences between other families within the fold established the sequence continuum to demonstrate 373 difficult relationships. Ultimately, as a practical and realistic extension, we demonstrate that such protein-like sequences can be “plugged-into” routine and generic sequence database searches to empower not only remote homology detection but also fold recognition. Our richly statistically supported findings show that complementary searches in both databases will increase the effectiveness of sequence-based searches in recognizing all homologues sharing a common fold.  相似文献   

5.
Histone Sequence Database: new histone fold family members.   总被引:2,自引:0,他引:2       下载免费PDF全文
Searches of the major public protein databases with core and linker chicken and human histone sequences have resulted in the compilation of an annotated set of histone protein sequences. In addition, new database searches with two distinct motif search algorithms have identified several members of the histone fold family, including human DRAP1 and yeast CSE4. Database resources include information on conflicts between similar sequence entries in different source databases, multiple sequence alignments, links to the Entrez integrated information retrieval system, structures for histone and histone fold proteins, and the ability to visualize structural data through Cn3D. The database currently contains >1000 protein sequences, which are searchable by protein type, accession number, organism name, or any other free text appearing in the definition line of the entry. All sequences and alignments in this database are available through the World Wide Web at http://www.nhgri.nih. gov/DIR/GTB/HISTONES or http://www.ncbi.nlm.nih. gov/Baxevani/HISTONES  相似文献   

6.
Certain prokaryotic transport proteins similar to the lactose permease of Escherichia coli (LacY) have been identified by BLAST searches from available genomic databanks. These proteins exhibit conservation of amino acid residues that participate in sugar binding and H(+) translocation in LacY. Homology threading of prokaryotic transporters based on the X-ray structure of LacY (PDB ID: 1PV7) and sequence similarities reveals a common overall fold for sugar transporters belonging to the Major Facilitator Superfamily (MFS) and suggest new targets for study. Evolution-based searches for sequence similarities also identify eukaryotic proteins bearing striking resemblance to MFS sugar transporters. Like LacY, the eukaryotic proteins are predicted to have 12 transmembrane domains (TMDs), and many of the irreplaceable residues for sugar binding and H(+) translocation in LacY appear to be largely conserved. The overall size of the eukaryotic homologs is about twice that of prokaryotic permeases with longer N and C termini and loops between TMDs III-IV and VI-VII. The human gene encoding protein FLJ20160 consists of six exons located on more than 60,000 bp of DNA sequences and requires splicing to produce mature mRNA. Cellular localization predictions suggest membrane insertion with possible proteolysis at the N terminus, and expression studies with the human protein FJL20160 demonstrate membrane insertion in both E.coli and Pichia pastoris. Widespread expression of the eukaryotic sugar transport candidates suggests an important role in cellular metabolism, particularly in brain and tumors. Homology is observed in the TMDs of both the eukaryotic and prokaryotic proteins that contain residues involved in sugar binding and H(+) translocation in LacY.  相似文献   

7.
A substantial fraction of protein sequences derived from genomic analyses is currently classified as representing 'hypothetical proteins of unknown function'. In part, this reflects the limitations of methods for comparison of sequences with very low identity. We evaluated the effectiveness of a Psi-BLAST search strategy to identify proteins of similar fold at low sequence identity. Psi-BLAST searches for structurally characterized low-sequence-identity matches were carried out on a set of over 300 proteins of known structure. Searches were conducted in NCBI's non-redundant database and were limited to three rounds. Some 614 potential homologs with 25% or lower sequence identity to 166 members of the search set were obtained. Disregarding the expect value, level of sequence identity and span of alignment, correspondence of fold between the target and potential homolog was found in more than 95% of the Psi-BLAST matches. Restrictions on expect value or span of alignment improved the false positive rate at the expense of eliminating many true homologs. Approximately three-quarters of the putative homologs obtained by three rounds of Psi-BLAST revealed no significant sequence similarity to the target protein upon direct sequence comparison by BLAST, and therefore could not be found by a conventional search. Although three rounds of Psi-BLAST identified many more homologs than a standard BLAST search, most homologs were undetected. It appears that more than 80% of all homologs to a target protein may be characterized by a lack of significant sequence similarity. We suggest that conservative use of Psi-BLAST has the potential to propose experimentally testable functions for the majority of proteins currently annotated as 'hypothetical proteins of unknown function'.  相似文献   

8.
GGDEF domain is homologous to adenylyl cyclase   总被引:21,自引:0,他引:21  
Pei J  Grishin NV 《Proteins》2001,42(2):210-216
The GGDEF domain is detected in many prokaryotic proteins, most of which are of unknown function. Several bacteria carry 12-22 different GGDEF homologues in their genomes. Conducting extensive profile-based searches, we detect statistically supported sequence similarity between GGDEF domain and adenylyl cyclase catalytic domain. From this homology, we deduce that the prokaryotic GGDEF domain is a regulatory enzyme involved in nucleotide cyclization, with the fold similar to that of the eukaryotic cyclase catalytic domain. This prediction correlates with the functional information available on two GGDEF-containing proteins, namely diguanylate cyclase and phosphodiesterase A of Acetobacter xylinum, both of which regulate the turnover of cyclic diguanosine monophosphate. Domain architecture analysis shows that GGDEF is typically present in multidomain proteins containing regulatory domains of signaling pathways or protein-protein interaction modules. Evolutionary tree analysis indicates that GGDEF/cyclase superfamily forms a large diversified cluster of orthologous proteins present in bacteria, archaea, and eukaryotes. Proteins 2001;42:210-216.  相似文献   

9.
Profile-based sequence search procedures are commonly employed to detect remote relationships between proteins. We provide an assessment of a Cascade PSI-BLAST protocol that rigorously employs intermediate sequences in detecting remote relationships between proteins. In this approach we detect using PSI-BLAST, which involves multiple rounds of iteration, an initial set of homologues for a protein in a 'first generation' search by querying a database. We propagate a 'second generation' search in the database, involving multiple runs of PSI-BLAST using each of the homologues identified in the previous generation as queries to recognize homologues not detected earlier. This non-directed search process can be viewed as an iteration of iterations that is continued to detect further homologues until no new hits are detectable. We present an assessment of the coverage of this 'cascaded' intermediate sequence search on diverse folds and find that searches for up to three generations detect most known homologues of a query. Our assessments show that this approach appears to perform better than the traditional use of PSI-BLAST by detecting 15% more relationships within a family and 35% more relationships within a superfamily. We show that such searches can be performed on generalized sequence databases and non-trivial relationships between proteins can be detected effectively. Such a propagation of searches maximizes the chances of detecting distant homologies by effectively scanning protein "fold space".  相似文献   

10.
Using sensitive structure similarity searches, we identify a shared alpha+beta fold, RAGNYA, principally involved in nucleic acid, nucleotide or peptide interactions in a diverse group of proteins. These include the Ribosomal proteins L3 and L1, ATP-grasp modules, the GYF domain, DNA-recombination proteins of the NinB family from caudate bacteriophages, the C-terminal DNA-interacting domain of the Y-family DNA polymerases, the uncharacterized enzyme AMMECR1, the siRNA silencing repressor of tombusviruses, tRNA Wybutosine biosynthesis enzyme Tyw3p, DNA/RNA ligases and related nucleotidyltransferases and the Enhancer of rudimentary proteins. This fold exhibits three distinct circularly permuted versions and is composed of an internal repeat of a unit with two-strands and a helix. We show that despite considerable structural diversity in the fold, its representatives show a common mode of nucleic acid or nucleotide interaction via the exposed face of the sheet. Using this information and sensitive profile-based sequence searches: (1) we predict the active site, and mode of substrate interaction of the Wybutosine biosynthesis enzyme, Tyw3p, and a potential catalytic role for AMMECR1. (2) We provide insights regarding the mode of nucleic acid interaction of the NinB proteins, and the evolution of the active site of classical ATP-grasp enzymes and DNA/RNA ligases. (3) We also present evidence for a bacterial origin of the GYF domain and propose how this version of the fold might have been utilized in peptide interactions in the context of nucleoprotein complexes.  相似文献   

11.
In the fold recognition approach to structure prediction, a sequence is tested for compatibility with an already known fold. For membrane proteins, however, few folds have been determined experimentally. Here the feasibility of computing the vast majority of likely membrane protein folds is tested. The results indicate that conformation space can be effectively sampled for small numbers of helices. The vast majority of potential monomeric membrane protein structures can be represented by about 30-folds for three helices, but increases exponentially to about 1,500,000 folds for seven helices. The generated folds could serve as templates for fold recognition or as starting points for conformational searches that are well distributed throughout conformation space.  相似文献   

12.
Abstract

Profile-based sequence search procedures are commonly employed to detect remote relationships between proteins. We provide an assessment of a Cascade PSI-BLAST protocol that rigorously employs intermediate sequences in detecting remote relationships between proteins. In this approach we detect using PSI-BLAST, which involves multiple rounds of iteration, an initial set of homologues for a protein in a ‘first generation’ search by querying a database. We propagate a ‘second generation’ search in the database, involving multiple runs of PSI-BLAST using each of the homologues identified in the previous generation as queries to recognize homologues not detected earlier. This non-directed search process can be viewed as an iteration of iterations that is continued to detect further homologues until no new hits are detectable. We present an assessment of the coverage of this ‘cascaded’ intermediate sequence search on diverse folds and find that searches for up to three generations detect most known homologues of a query. Our assessments show that this approach appears to perform better than the traditional use of PSI-BLAST by detecting 15% more relationships within a family and 35% more relationships within a superfamily. We show that such searches can be performed on generalized sequence databases and non-trivial relationships between proteins can be detected effectively. Such a propagation of searches maximizes the chances of detecting distant homologies by effectively scanning protein “fold space”.  相似文献   

13.
Improving fold recognition without folds   总被引:4,自引:0,他引:4  
The most reliable way to align two proteins of unknown structure is through sequence-profile and profile-profile alignment methods. If the structure for one of the two is known, fold recognition methods outperform purely sequence-based alignments. Here, we introduced a novel method that aligns generalised sequence and predicted structure profiles. Using predicted 1D structure (secondary structure and solvent accessibility) significantly improved over sequence-only methods, both in terms of correctly recognising pairs of proteins with different sequences and similar structures and in terms of correctly aligning the pairs. The scores obtained by our generalised scoring matrix followed an extreme value distribution; this yielded accurate estimates of the statistical significance of our alignments. We found that mistakes in 1D structure predictions correlated between proteins from different sequence-structure families. The impact of this surprising result was that our method succeeded in significantly out-performing sequence-only methods even without explicitly using structural information from any of the two. Since AGAPE also outperformed established methods that rely on 3D information, we made it available through. If we solved the problem of CPU-time required to apply AGAPE on millions of proteins, our results could also impact everyday database searches.  相似文献   

14.
Modeling the inherent flexibility of the protein backbone as part of computational protein design is necessary to capture the behavior of real proteins and is a prerequisite for the accurate exploration of protein sequence space. We present the results of a broad exploration of sequence space, with backbone flexibility, through a novel approach: large-scale protein design to structural ensembles. A distributed computing architecture has allowed us to generate hundreds of thousands of diverse sequences for a set of 253 naturally occurring proteins, allowing exciting insights into the nature of protein sequence space. Designing to a structural ensemble produces a much greater diversity of sequences than previous studies have reported, and homology searches using profiles derived from the designed sequences against the Protein Data Bank show that the relevance and quality of the sequences is not diminished. The designed sequences have greater overall diversity than corresponding natural sequence alignments, and no direct correlations are seen between the diversity of natural sequence alignments and the diversity of the corresponding designed sequences. For structures in the same fold, the sequence entropies of the designed sequences cluster together tightly. This tight clustering of sequence entropies within a fold and the separation of sequence entropy distributions for different folds suggest that the diversity of designed sequences is primarily determined by a structure's overall fold, and that the designability principle postulated from studies of simple models holds in real proteins. This has important implications for experimental protein design and engineering, as well as providing insight into protein evolution.  相似文献   

15.
We have performed computer searches in the database of known protein sequences for proteins similar in sequence to bacteriophage regulatory proteins of known 3-D structure. The searches are more selective than other methods due to the use of a length-dependent threshold in sequence similarity, above which structural homology is implied with high certainty. Two probable DNA binding proteins were identified which are predicted to have a three-dimensional structure very similar to bacteriophage cro and repressor proteins. Approximate three-dimensional model coordinates are available from the authors. Both proteins contain the helix-turn-helix sequence motif typical of a wide class of DNA binding proteins and their function is deduced by analogy to sequence-similar proteins of known function. We predict that the Y.Smal protein in the restriction-modification enzyme gene locus of the enterobacterium serratia marcescens is a regulator of endonuclease expression; and, that the vegetative specific gene VSH7 of the slime mold dictyostelium discoideum codes for a regulator of gene expression specific for the slime mold growth phase before the onset of the developmental program. Point mutations that would have a strong effect on growth regulation phenotype are suggested. The VSH7 protein would be the first eukaryotic representative of the cro/phage repressor class.  相似文献   

16.
17.
The accelerating growth in the number of protein sequences taxes both the computational and manual resources needed to analyze them. One approach to dealing with this problem is to minimize the number of proteins subjected to such analysis in a way that minimizes loss of information. To this end we have developed a set of Representative Proteomes (RPs), each selected from a Representative Proteome Group (RPG) containing similar proteomes calculated based on co-membership in UniRef50 clusters. A Representative Proteome is the proteome that can best represent all the proteomes in its group in terms of the majority of the sequence space and information. RPs at 75%, 55%, 35% and 15% co-membership threshold (CMT) are provided to allow users to decrease or increase the granularity of the sequence space based on their requirements. We find that a CMT of 55% (RP55) most closely follows standard taxonomic classifications. Further analysis of this set reveals that sequence space is reduced by more than 80% relative to UniProtKB, while retaining both sequence diversity (over 95% of InterPro domains) and annotation information (93% of experimentally characterized proteins). All sets can be browsed and are available for sequence similarity searches and download at http://www.proteininformationresource.org/rps, while the set of 637 RPs determined using a 55% CMT are also available for text searches. Potential applications include sequence similarity searches, protein classification and targeted protein annotation and characterization.  相似文献   

18.
Structural genomics strives to represent the entire protein space. The first step towards achieving this goal is by rationally selecting proteins whose structures have not been determined, but that represent an as yet unknown structural superfamily or fold. Once such a structure is solved, it can be used as a template for modelling homologous proteins. This will aid in unveiling the structural diversity of the protein space. Currently, no reliable method for accurate 3D structural prediction is available when a sequence or a structure homologue is not available. Here we present a systematic methodology for selecting target proteins whose structure is likely to adopt a new, as yet unknown superfamily or fold. Our method takes advantage of a global classification of the sequence space as presented by ProtoNet-3D, which is a hierarchical agglomerative clustering of the proteins of interest (the proteins in Swiss-Prot) along with all solved structures (taken from the PDB). By navigating in the scaffold of ProtoNet-3D, we yield a prioritized list of proteins that are not yet structurally solved, along with the probability of each of the proteins belonging to a new superfamily or fold. The sorted list has been self-validated against real structural data that was not available when the predictions were made. The practical application of using our computational-statistical method to determine novel superfamilies for structural genomics projects is also discussed.  相似文献   

19.
Peptidase family U34 consists of enzymes with unclear catalytic mechanism, for instance, dipeptidase A from Lactobacillus helveticus. Using extensive sequence similarity searches, we infer that U34 family members are homologous to penicillin V acylases (PVA) and thus potentially adopt the N-terminal nucleophile (Ntn) hydrolase fold. Comparative sequence and structural analysis reveals a cysteine as the catalytic nucleophile as well as other conserved residues important for catalysis. The PVA/U34 family is variable in sequence and exhibits great diversity in substrate specificity, to include enzymes such as choloyglycine hydrolases, acid ceramidases, isopenicillin N acyltransferases, and a subgroup of eukaryotic proteins with unclear function.  相似文献   

20.
Over the past two decades, many ingenious efforts have been made in protein remote homology detection. Because homologous proteins often diversify extensively in sequence, it is challenging to demonstrate such relatedness through entirely sequence-driven searches. Here, we describe a computational method for the generation of 'protein-like' sequences that serves to bridge gaps in protein sequence space. Sequence profile information, as embodied in a position-specific scoring matrix of multiply aligned sequences of bona fide family members, serves as the starting point in this algorithm. The observed amino acid propensity and the selection of a random number dictate the selection of a residue for each position in the sequence. In a systematic manner, and by applying a 'roulette-wheel' selection approach at each position, we generate parent family-like sequences and thus facilitate an enlargement of sequence space around the family. When generated for a large number of families, we demonstrate that they expand the utility of natural intermediately related sequences in linking distant proteins. In 91% of the assessed examples, inclusion of designed sequences improved fold coverage by 5-10% over searches made in their absence. Furthermore, with several examples from proteins adopting folds such as TIM, globin, lipocalin and others, we demonstrate that the success of including designed sequences in a database positively sensitized methods such as PSI-BLAST and Cascade PSI-BLAST and is a promising opportunity for enormously improved remote homology recognition using sequence information alone.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号