首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Structural genomics strives to represent the entire protein space. The first step towards achieving this goal is by rationally selecting proteins whose structures have not been determined, but that represent an as yet unknown structural superfamily or fold. Once such a structure is solved, it can be used as a template for modelling homologous proteins. This will aid in unveiling the structural diversity of the protein space. Currently, no reliable method for accurate 3D structural prediction is available when a sequence or a structure homologue is not available. Here we present a systematic methodology for selecting target proteins whose structure is likely to adopt a new, as yet unknown superfamily or fold. Our method takes advantage of a global classification of the sequence space as presented by ProtoNet-3D, which is a hierarchical agglomerative clustering of the proteins of interest (the proteins in Swiss-Prot) along with all solved structures (taken from the PDB). By navigating in the scaffold of ProtoNet-3D, we yield a prioritized list of proteins that are not yet structurally solved, along with the probability of each of the proteins belonging to a new superfamily or fold. The sorted list has been self-validated against real structural data that was not available when the predictions were made. The practical application of using our computational-statistical method to determine novel superfamilies for structural genomics projects is also discussed.  相似文献   

2.
MOTIVATION: Protein families can be defined based on structure or sequence similarity. We wanted to compare two protein family databases, one based on structural and one on sequence similarity, to investigate to what extent they overlap, the similarity in definition of corresponding families, and to create a list of large protein families with unknown structure as a resource for structural genomics. We also wanted to increase the sensitivity of fold assignment by exploiting protein family HMMs. RESULTS: We compared Pfam, a protein family database based on sequence similarity, to Scop, which is based on structural similarity. We found that 70% of the Scop families exist in Pfam while 57% of the Pfam families exist in Scop. Most families that occur in both databases correspond well to each other, but in some cases they are different. Such cases highlight situations in which structure and sequence approaches differ significantly. The comparison enabled us to compile a list of the largest families that do not occur in Scop; these are suitable targets for structure prediction and determination, and may be useful to guide projects in structural genomics. It can be noted that 13 out of the 20 largest protein families without a known structure are likely transmembrane proteins. We also exploited Pfam to increase the sensitivity of detecting homologs of proteins with known structure, by comparing query sequences to Pfam HMMs that correspond to Scop families. For SWISSPROT+TREMBL, this yielded an increase in fold assignment from 31% to 42% compared to using FASTA only. This method assigned a structure to 22% of the proteins in Saccharomyces cerevisiae, 24% in Escherichia coli, and 16% in Methanococcus jannaschii.  相似文献   

3.
McGuffin LJ  Jones DT 《Proteins》2002,48(1):44-52
The ultimate goal of structural genomics is to obtain the structure of each protein coded by each gene within a genome to determine gene function. Because of cost and time limitations, it remains impractical to solve the structure for every gene product experimentally. Up to a point, reasonably accurate three‐dimensional structures can be deduced for proteins with homologous sequences by using comparative modeling. Beyond this, fold recognition or threading methods can be used for proteins showing little homology to any known fold, although this is relatively time‐consuming and limited by the library of template folds currently available. Therefore, it is appropriate to develop methods that can increase our knowledge base, expanding our fold libraries by earmarking potentially “novel” folds for experimental structure determination. How can we sift through proteomic data rapidly and yet reliably identify novel folds as targets for structural genomics? We have analyzed a number of simple methods that discriminate between “novel” and “known” folds. We propose that simple alignments of secondary structure elements using predicted secondary structure could potentially be a more selective method than both a simple fold recognition method (GenTHREADER) and standard sequence alignment at finding novel folds when sequences show no detectable homology to proteins with known structures. Proteins 2002;48:44–52. © 2002 Wiley‐Liss, Inc.  相似文献   

4.
The rapid growth in protein structural data and the emergence of structural genomics projects have increased the need for automatic structure analysis and tools for function prediction. Small molecule recognition is critical to the function of many proteins; therefore, determination of ligand binding site similarity is important for understanding ligand interactions and may allow their functional classification. Here, we present a binding sites database (SitesBase) that given a known protein-ligand binding site allows rapid retrieval of other binding sites with similar structure independent of overall sequence or fold similarity. However, each match is also annotated with sequence similarity and fold information to aid interpretation of structure and functional similarity. Similarity in ligand binding sites can indicate common binding modes and recognition of similar molecules, allowing potential inference of function for an uncharacterised protein or providing additional evidence of common function where sequence or fold similarity is already known. Alternatively, the resource can provide valuable information for detailed studies of molecular recognition including structure-based ligand design and in understanding ligand cross-reactivity. Here, we show examples of atomic similarity between superfamily or more distant fold relatives as well as between seemingly unrelated proteins. Assignment of unclassified proteins to structural superfamiles is also undertaken and in most cases substantiates assignments made using sequence similarity. Correct assignment is also possible where sequence similarity fails to find significant matches, illustrating the potential use of binding site comparisons for newly determined proteins.  相似文献   

5.
Structural genomics and its importance for gene function analysis   总被引:8,自引:0,他引:8  
Structural genomics projects aim to solve the experimental structures of all possible protein folds. Such projects entail a conceptual shift from traditional structural biology in which structural information is obtained on known proteins to one in which the structure of a protein is determined first and the function assigned only later. Whereas the goal of converting protein structure into function can be accomplished by traditional sequence motif-based approaches, recent studies have shown that assignment of a protein's biochemical function can also be achieved by scanning its structure for a match to the geometry and chemical identity of a known active site. Importantly, this approach can use low-resolution structures provided by contemporary structure prediction methods. When applied to genomes, structural information (either experimental or predicted) is likely to play an important role in high-throughput function assignment.  相似文献   

6.
One of the major goals of structural genomics projects is to determine the three-dimensional structure of representative members of as many different fold families as possible. Comparative modeling is expected to fill the remaining gaps by providing structural models of homologs of the experimentally determined proteins. However, for such an approach to be successful it is essential that the quality of the experimentally determined structures is adequate. In an attempt to build a homology model for the protein dynein light chain 2A (DLC2A) we found two potential templates, both experimentally determined nuclear magnetic resonance (NMR) structures originating from structural genomics efforts. Despite their high sequence identity (96%), the folds of the two structures are markedly different. This urged us to perform in-depth analyses of both structure ensembles and the deposited experimental data, the results of which clearly identify one of the two models as largely incorrect. Next, we analyzed the quality of a large set of recent NMR-derived structure ensembles originating from both structural genomics projects and individual structure determination groups. Unfortunately, a visual inspection of structures exhibiting lower quality scores than DLC2A reveals that the seriously flawed DLC2A structure is not an isolated incident. Overall, our results illustrate that the quality of NMR structures cannot be reliably evaluated using only traditional experimental input data and overall quality indicators as a reference and clearly demonstrate the urgent need for a tight integration of more sophisticated structure validation tools in NMR structure determination projects. In contrast to common methodologies where structures are typically evaluated as a whole, such tools should preferentially operate on a per-residue basis.  相似文献   

7.
Kinch LN  Grishin NV 《Proteins》2002,48(1):75-84
Nitrogen regulatory (PII) proteins are signal transduction molecules involved in controlling nitrogen metabolism in prokaryots. PII proteins integrate the signals of intracellular nitrogen and carbon status into the control of enzymes involved in nitrogen assimilation. Using elaborate sequence similarity detection schemes, we show that five clusters of orthologs (COGs) and several small divergent protein groups belong to the PII superfamily and predict their structure to be a (betaalphabeta)(2) ferredoxin-like fold. Proteins from the newly emerged PII superfamily are present in all major phylogenetic lineages. The PII homologs are quite diverse, with below random (as low as 1%) pairwise sequence identities between some members of distant groups. Despite this sequence diversity, evidence suggests that the different subfamilies retain the PII trimeric structure important for ligand-binding site formation and maintain a conservation of conservations at residue positions important for PII function. Because most of the orthologous groups within the PII superfamily are composed entirely of hypothetical proteins, our remote homology-based structure prediction provides the only information about them. Analogous to structural genomics efforts, such prediction gives clues to the biological roles of these proteins and allows us to hypothesize about locations of functional sites on model structures or rationalize about available experimental information. For instance, conserved residues in one of the families map in close proximity to each other on PII structure, allowing for a possible metal-binding site in the proteins coded by the locus known to affect sensitivity to divalent metal ions. Presented analysis pushes the limits of sequence similarity searches and exemplifies one of the extreme cases of reliable sequence-based structure prediction. In conjunction with structural genomics efforts to shed light on protein function, our strategies make it possible to detect homology between highly diverse sequences and are aimed at understanding the most remote evolutionary connections in the protein world.  相似文献   

8.
MOTIVATION: A method for recognizing the three-dimensional fold from the protein amino acid sequence based on a combination of hidden Markov models (HMMs) and secondary structure prediction was recently developed for proteins in the Mainly-Alpha structural class. Here, this methodology is extended to Mainly-Beta and Alpha-Beta class proteins. Compared to other fold recognition methods based on HMMs, this approach is novel in that only secondary structure information is used. Each HMM is trained from known secondary structure sequences of proteins having a similar fold. Secondary structure prediction is performed for the amino acid sequence of a query protein. The predicted fold of a query protein is the fold described by the model fitting the predicted sequence the best. RESULTS: After model cross-validation, the success rate on 44 test proteins covering the three structural classes was found to be 59%. On seven fold predictions performed prior to the publication of experimental structure, the success rate was 71%. In conclusion, this approach manages to capture important information about the fold of a protein embedded in the length and arrangement of the predicted helices, strands and coils along the polypeptide chain. When a more extensive library of HMMs representing the universe of known structural families is available (work in progress), the program will allow rapid screening of genomic databases and sequence annotation when fold similarity is not detectable from the amino acid sequence. AVAILABILITY: FORESST web server at http://absalpha.dcrt.nih.gov:8008/ for the library of HMMs of structural families used in this paper. FORESST web server at http://www.tigr.org/ for a more extensive library of HMMs (work in progress). CONTACT: valedf@tigr.org; munson@helix.nih.gov; garnier@helix.nih.gov  相似文献   

9.
Recent progress in structure determination techniques has led to a significant growth in the number of known membrane protein structures, and the first structural genomics projects focusing on membrane proteins have been initiated, warranting an investigation of appropriate bioinformatics strategies for optimal structural target selection for these molecules. What determines a membrane protein fold? How many membrane structures need to be solved to provide sufficient structural coverage of the membrane protein sequence space? We present the CAMPS database (Computational Analysis of the Membrane Protein Space) containing almost 45,000 proteins with three or more predicted transmembrane helices (TMH) from 120 bacterial species. This large set of membrane proteins was subjected to single‐linkage clustering using only sequence alignments covering at least 40% of the TMH present in a given family. This process yielded 266 sequence clusters with at least 15 members, roughly corresponding to membrane structural folds, sufficiently structurally homogeneous in terms of the variation of TMH number between individual sequences. These clusters were further subdivided into functionally homogeneous subclusters according to the COG (Clusters of Orthologous Groups) system as well as more stringently defined families sharing at least 30% identity. The CAMPS sequence clusters are thus designed to reflect three main levels of interest for structural genomics: fold, function, and modeling distance. We present a library of Hidden Markov Models (HMM) derived from sequence alignments of TMH at these three levels of sequence similarity. Given that 24 out of 266 clusters corresponding to membrane folds already have associated known structures, we estimate that 242 additional new structures, one for each remaining cluster, would provide structural coverage at the fold level of roughly 70% of prokaryotic membrane proteins belonging to the currently most populated families. Proteins 2006. © 2006 Wiley‐Liss, Inc.  相似文献   

10.
Large-scale sequencing projects are widening the gap between the known protein universe and the fraction for which structural information has been experimentally obtained. Through the application of homology (comparative) modeling and more general structure prediction techniques, this gap can, however, be narrowed, providing indirect structural information for a considerable number of proteins. Moreover, the estimated number of existing protein folds seems to be limited and many of these yet unknown folds should be discovered by dedicated large-scale structural genomics projects. Within this perspective, homology (comparative) modeling will gain in importance, as will the use of models derived by this technique. Here we discuss how well a sequence alignment, the most common starting point for generating a model, reflects the structural conservation between homologous proteins and we show that sequence information is able to direct construction of acceptable models as far as the structural core is concerned. We also show here that the regions surrounding insertions and deletions are much less conserved than the core and discuss the implications of this observation for loop modeling.  相似文献   

11.
A number of recent advances have been made in deriving function information from protein structure. A fold relationship to an already characterized protein will often allow general information about function to be deduced. More detailed information can be obtained using sequence relationships to already studied proteins. Methods of deducing function directly from structure, without the use of evolutionary relationships, are developing rapidly. All such methods may be used with models of protein structure, rather than with experimentally determined ones, but model accuracy imposes limitations. The rapid expansion of the structural genomics field has created a new urgency for improved methods of structure-based annotation of function.  相似文献   

12.
Structural genomics (or proteomics) activities are critically dependent on the availability of high-throughput structure determination methodology. Development of such methodology has been a particular challenge for NMR based structure determination because of the demands for isotopic labeling of proteins and the requirements for very long data acquisition times. We present here a methodology that gains efficiency from a focus on determination of backbone structures of proteins as opposed to full structures with all sidechains in place. This focus is appropriate given the presumption that many protein structures in the future will be built using computational methods that start from representative fold family structures and replace as many as 70% of the sidechains in the course of structure determination. The methodology we present is based primarily on residual dipolar couplings (RDCs), readily accessible NMR observables that constrain the orientation of backbone fragments irrespective of separation in space. A new software tool is described for the assembly of backbone fragments under RDC constraints and an application to a structural genomics target is presented. The target is an 8.7 kDa protein from Pyrococcus furiosus, PF1061, that was previously not well annotated, and had a nearest structurally characterized neighbor with only 33% sequence identity. The structure produced shows structural similarity to this sequence homologue, but also shows similarity to other proteins, which suggests a functional role in sulfur transfer. Given the backbone structure and a possible functional link this should be an ideal target for development of modeling methods.  相似文献   

13.
Structural genomics projects require strategies for rapidly recognizing protein sequences appropriate for routine structure determination. For large proteins, this strategy includes the dissection of proteins into structural domains that form stable native structures. However, protein dissection essentially remains an empirical and often a tedious process. Here, we describe a simple strategy for rapidly identifying structural domains and assessing their structures. This approach combines the computational prediction of sequence regions corresponding to putative domains with an experimental assessment of their structures and stabilities by NMR and biochemical methods. We tested this approach with nine putative domains predicted from a set of 108 Thermus thermophilus HB8 sequences using PASS, a domain prediction program we previously reported. To facilitate the experimental assessment of the domain structures, we developed a generic 6-hour His-tag-based purification protocol, which enables the sample quality evaluation of a putative structural domain in a single day. As a result, we observed that half of the predicted structural domains were indeed natively folded, as judged by their HSQC spectra. Furthermore, two of the natively folded domains were novel, without related sequences classified in the Pfam and SMART databases, which is a significant result with regard to the ability of structural genomics projects to uniformly cover the protein fold space.  相似文献   

14.
Structural genomic projects envision almost routine protein structure determinations, which are currently imaginable only for small proteins with molecular weights below 25,000 Da. For larger proteins, structural insight can be obtained by breaking them into small segments of amino acid sequences that can fold into native structures, even when isolated from the rest of the protein. Such segments are autonomously folding units (AFU) and have sizes suitable for fast structural analyses. Here, we propose to expand an intuitive procedure often employed for identifying biologically important domains to an automatic method for detecting putative folded protein fragments. The procedure is based on the recognition that large proteins can be regarded as a combination of independent domains conserved among diverse organisms. We thus have developed a program that reorganizes the output of BLAST searches and detects regions with a large number of similar sequences. To automate the detection process, it is reduced to a simple geometrical problem of recognizing rectangular shaped elevations in a graph that plots the number of similar sequences at each residue of a query sequence. We used our program to quantitatively corroborate the premise that segments with conserved sequences correspond to domains that fold into native structures. We applied our program to a test data set composed of 99 amino acid sequences containing 150 segments with structures listed in the Protein Data Bank, and thus known to fold into native structures. Overall, the fragments identified by our program have an almost 50% probability of forming a native structure, and comparable results are observed with sequences containing domain linkers classified in SCOP. Furthermore, we verified that our program identifies AFU in libraries from various organisms, and we found a significant number of AFU candidates for structural analysis, covering an estimated 5 to 20% of the genomic databases. Altogether, these results argue that methods based on sequence similarity can be useful for dissecting large proteins into small autonomously folding domains, and such methods may provide an efficient support to structural genomics projects.  相似文献   

15.
This paper evaluates the results of a protein structure prediction contest. The predictions were made using threading procedures, which employ techniques for aligning sequences with 3D structures to select the correct fold of a given sequence from a set of alternatives. Nine different teams submitted 86 predictions, on a total of 21 target proteins with little or no sequence homology to proteins of known structure. The 3D structures of these proteins were newly determined by experimental methods, but not yet published or otherwise available to the predictors. The predictions, made from the amino acid sequence alone, thus represent a genuine test of the current performance of threading methods. Only a subset of all the predictions is evaluated here. It corresponds to the 44 predictions submitted for the 11 target proteins seen to adopt known folds. The predictions for the remaining 10 proteins were not analyzed, although weak similarities with known folds may also exist in these proteins. We find that threading methods are capable of identifying the correct fold in many cases, but not reliably enough as yet. Every team predicts correctly a different set of targets, with virtually all targets predicted correctly by at least one team. Also, common folds such as TIM barrels are recognized more readily than folds with only a few known examples. However, quite surprisingly, the quality of the sequence-structure alignments, corresponding to correctly recognized folds, is generally very poor, as judged by comparison with the corresponding 3D structure alignments. Thus, threading can presently not be relied upon to derive a detailed 3D model from the amino acid sequence. This raises a very intriguing question: how is fold recognition achieved? Our analysis suggests that it may be achieved because threading procedures maximize hydrophobic interactions in the protein core, and are reasonably good at recognizing local secondary structure. © 1995 Wiley-Liss, Inc.  相似文献   

16.
Assigning function to structures is an important aspect of structural genomics projects, since they frequently provide structures for uncharacterized proteins. Similarities uncovered by structure alignment can suggest a similar function, even in the absence of sequence similarity. For proteins adopting novel folds or those with many functions, this strategy can fail, but functional clues can still come from comparison of local functional sites involving a few key residues. Here we assess the general applicability of functional site comparison through the study of 157 proteins solved by structural genomics initiatives. For 17, the method bolsters confidence in predictions made based on overall fold similarity. For another 12 with new folds, it suggests functions, including a putative phosphotyrosine binding site in the Archaeal protein Mth1187 and an active site for a ribose isomerase. The approach is applied weekly to all new structures, providing a resource for those interested in using structure to infer function.  相似文献   

17.
Structural genomics (also known as structural proteomics) aims to generate accurate three-dimensional models for all folded, globular proteins and domains in the protein universe to understand the relationship between protein sequence, structure and function. NMR spectroscopy of small (<20 kDa) proteins has been used successfully within several large-scale structural genomics projects for more than six years now. Recent advances coming from traditional NMR structural biology laboratories as well as large scale centers and consortia using NMR for structural genomics promise to facilitate NMR analysis making it even a more efficient and increasingly automated procedure.  相似文献   

18.
Over the next few years, various genome projects will sequence many new genes and yield many new gene products. Many of these products will have no known function and little, if any, sequence homology to existing proteins. There is reason to believe that a rapid determination of a protein fold, even at low resolution, can aid in the identification of function and expedite the determination of structure at higher resolution. Recently devised NMR methods of measuring residual dipolar couplings provide one route to the determination of a fold. They do this by allowing the alignment of previously identified secondary structural elements with respect to each other. When combined with constraints involving loops connecting elements or other short-range experimental distance information, a fold is produced. We illustrate this approach to protein fold determination on (15)N-labeled Eschericia coli acyl carrier protein using a limited set of (15)N-(1)H and (1)H-(1)H dipolar couplings. We also illustrate an approach using a more extended set of heteronuclear couplings on a related protein, (13)C, (15)N-labeled NodF protein from Rhizobium leguminosarum.  相似文献   

19.
There is currently a gap in knowledge between complexes of known three-dimensional structure and those known from other experimental methods such as affinity purifications or the two-hybrid system. This gap can sometimes be bridged by methods that extrapolate interaction information from one complex structure to homologues of the interacting proteins. To do this, it is important to know if and when proteins of the same type (e.g. family, superfamily or fold) interact in the same way. Here, we study interactions of known structure to address this question. We found all instances within the structural classification of proteins database of the same domain pairs interacting in different complexes, and then compared them with a simple measure (interaction RMSD). When plotted against sequence similarity we find that close homologues (30-40% or higher sequence identity) almost invariably interact the same way. Conversely, similarity only in fold (i.e. without additional evidence for a common ancestor) is only rarely associated with a similarity in interaction. The results suggest that there is a twilight zone of sequence similarity where it is not possible to say whether or not domains will interact similarly. We also discuss the rare instances of fold similarities interacting the same way, and those where obviously homologous proteins interact differently.  相似文献   

20.
Structural genomics (or proteomics) activities are critically dependent on the availability of high-throughput structure determination methodology. Development of such methodology has been a particular challenge for NMR based structure determination because of the demands for isotopic labeling of proteins and the requirements for very long data acquisition times. We present here a methodology that gains efficiency from a focus on determination of backbone structures of proteins as opposed to full structures with all sidechains in place. This focus is appropriate given the presumption that many protein structures in the future will be built using computational methods that start from representative fold family structures and replace as many as 70% of the sidechains in the course of structure determination. The methodology we present is based primarily on residual dipolar couplings (RDCs), readily accessible NMR observables that constrain the orientation of backbone fragments irrespective of separation in space. A new software tool is described for the assembly of backbone fragments under RDC constraints and an application to a structural genomics target is presented. The target is an 8.7 kDa protein from Pyrococcus furiosus, PF1061, that was previously not well annotated, and had a nearest structurally characterized neighbor with only 33% sequence identity. The structure produced shows structural similarity to this sequence homologue, but also shows similarity to other proteins, which suggests a functional role in sulfur transfer. Given the backbone structure and a possible functional link this should be an ideal target for development of modeling methods. This revised version was published online in March 2005 with corrections to the references.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号