首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Given the massive increase in the number of new sequences and structures, a critical problem is how to integrate these raw data into meaningful biological information. One approach, the Evolutionary Trace, or ET, uses phylogenetic information to rank the residues in a protein sequence by evolutionary importance and then maps those ranked at the top onto a representative structure. If these residues form structural clusters, they can identify functional surfaces such as those involved in molecular recognition. Now that a number of examples have shown that ET can identify binding sites and focus mutational studies on their relevant functional determinants, we ask whether the method can be improved so as to be applicable on a large scale. To address this question, we introduce a new treatment of gaps resulting from insertions and deletions, which streamlines the selection of sequences used as input. We also introduce objective statistics to assess the significance of the total number of clusters and of the size of the largest one. As a result of the novel treatment of gaps, ET performance improves measurably. We find evolutionarily privileged clusters that are significant at the 5% level in 45 out of 46 (98%) proteins drawn from a variety of structural classes and biological functions. In 37 of the 38 proteins for which a protein-ligand complex is available, the dominant cluster contacts the ligand. We conclude that spatial clustering of evolutionarily important residues is a general phenomenon, consistent with the cooperative nature of residues that determine structure and function. In practice, these results suggest that ET can be applied on a large scale to identify functional sites in a significant fraction of the structures in the protein databank (PDB). This approach to combining raw sequences and structure to obtain detailed insights into the molecular basis of function should prove valuable in the context of the Structural Genomics Initiative.  相似文献   

2.
We describe a novel approach for inferring functional relationship of proteins by detecting sequence and spatial patterns of protein surfaces. Well-formed concave surface regions in the form of pockets and voids are examined to identify similarity relationship that might be directly related to protein function. We first exhaustively identify and measure analytically all 910,379 surface pockets and interior voids on 12,177 protein structures from the Protein Data Bank. The similarity of patterns of residues forming pockets and voids are then assessed in sequence, in spatial arrangement, and in orientational arrangement. Statistical significance in the form of E and p-values is then estimated for each of the three types of similarity measurements. Our method is fully automated without human intervention and can be used without input of query patterns. It does not assume any prior knowledge of functional residues of a protein, and can detect similarity based on surface patterns small and large. It also tolerates, to some extent, conformational flexibility of functional sites. We show with examples that this method can detect functional relationship with specificity for members of the same protein family and superfamily, as well as remotely related functional surfaces from proteins of different fold structures. We envision that this method can be used for discovering novel functional relationship of protein surfaces, for functional annotation of protein structures with unknown biological roles, and for further inquiries on evolutionary origins of structural elements important for protein function.  相似文献   

3.

Background  

Proteins that evolve from a common ancestor can change functionality over time, and it is important to be able identify residues that cause this change. In this paper we show how a supervised multivariate statistical method, Between Group Analysis (BGA), can be used to identify these residues from families of proteins with different substrate specifities using multiple sequence alignments.  相似文献   

4.
Wang C  Ye M  Han G  Chen R  Zhang M  Jiang X  Cheng K  Wang F  Zou H 《Proteomics》2011,11(17):3578-3581
Multiple residues with consensus sequence, i.e. motif, on proteins are closely related to protein function. However, there is no effective method for targeted analysis of such proteins. The challenge for analysis of these classes of proteins by MS is how to selectively enrich peptides containing consensus sequence from protein digest. Although enrichment of peptides containing one type of amino acid residue was successfully achieved by chemically labeling followed by chromatographic isolation, however, it is almost impossible to label and isolate signature peptides containing multiple residues with consensus sequence by chemical approach. Herein, we developed an enzymatic approach based on the specific recognition between enzyme and its substrates to enrich such peptides. This approach was realized by modification of a residue in the consensus sequence via enzyme that can recognize the sequence followed by the isolation of the modified peptides. cAMP-dependent protein kinase was used to validate this approach and 168 peptides containing consensus motif were identified with selectivity of 67.2%. Those peptides resulted in the identification of 88 proteins with consensus sequence from serum sample. As this motif-oriented peptide enrichment approach allows targeted analysis of a subset of proteins with consensus sequence, it will have broad application in biological studies.  相似文献   

5.
DNA-binding proteins are crucial for various cellular processes and hence have become an important target for both basic research and drug development. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to establish an automated method for rapidly and accurately identifying DNA-binding proteins based on their sequence information alone. Owing to the fact that all biological species have developed beginning from a very limited number of ancestral species, it is important to take into account the evolutionary information in developing such a high-throughput tool. In view of this, a new predictor was proposed by incorporating the evolutionary information into the general form of pseudo amino acid composition via the top-n-gram approach. It was observed by comparing the new predictor with the existing methods via both jackknife test and independent data-set test that the new predictor outperformed its counterparts. It is anticipated that the new predictor may become a useful vehicle for identifying DNA-binding proteins. It has not escaped our notice that the novel approach to extract evolutionary information into the formulation of statistical samples can be used to identify many other protein attributes as well.  相似文献   

6.
This study describes the further extension of the resonant recognition model for the analysis and prediction of protein--protein and protein--DNA structure/function dependencies. The model is based on the significant correlation between spectra of numerical presentations of the amino acid or nucleotide sequences of proteins and their coded biological activity. According to this physico-mathematical method, it is possible to define amino acids in the sequence which are predicted to be the most critical for protein function. Using sperm whale myoglobin, human hemoglobin and hen egg white lysozyme as model protein examples, sets of predicted amino acids, or so-called 'hot spots', have been identified within the tertiary structure. It was found for each protein that the predicted 'hot spots', which are distributed along the primary sequence, are spatially grouped in a dome-like arrangement over the active site. The identified amino acids did not correspond to the amino acid residues which are involved in the chemical reaction site of these proteins. It is thus proposed that the resonant recognition model helps to identify amino acid residues which are important for the creation of the molecular structure around the catalytic active site and also the associated physical field conditions required for biorecognition, docking of the specific substrate and full biological activity.  相似文献   

7.
8.
Many proteins involved in key biological processes are modular in nature. A group of these, the beta-propeller proteins, fold by packing 4-stranded beta-sheets in a circular array. The members of this group are increasingly numerous and, although their modular building blocks all preserve the same basic conformation, they do not have similar sequences. These proteins have extreme functional and phylogenetic diversity. Here, features of the beta-propeller fold are reviewed through comparisons of available structural coordinates. Structure-based sequence alignments combined with analyses of superpositions of individual modular units reveal conserved general features such as hydrogen bonds, beta-turns and positions of hydrophobic contacts. The lack of significant sequence identity is compensated by sets of interactions which stabilise the fold differently in distinct structures. Re-occurring aspartates make contacts to exposed backbone amides in turns or peptide connections within the same sheet. The sole factor responsible for the number of sheets that assemble in the array is the size of the hydrophobic residues that pack into the cores between the sheets. Whilst there is no overall sequence conservation, it may be possible to detect new members of this fold through sequence searches that take into account the repeated nature of the modular assembly as well as the positions of hydrophobic residues and H-bonding side chains.  相似文献   

9.

Background

Diacylglycerol acyltransferase families (DGATs) catalyze the final and rate-limiting step of triacylglycerol (TAG) biosynthesis in eukaryotic organisms. Understanding the roles of DGATs will help to create transgenic plants with value-added properties and provide clues for therapeutic intervention for obesity and related diseases. The objective of this analysis was to identify conserved sequence motifs and amino acid residues for better understanding of the structure-function relationship of these important enzymes.

Results

117 DGAT sequences from 70 organisms including plants, animals, fungi and human are obtained from database search using tung tree DGATs. Phylogenetic analysis separates these proteins into DGAT1 and DGAT2 subfamilies. These DGATs are integral membrane proteins with more than 40% of the total amino acid residues being hydrophobic. They have similar properties and amino acid composition except that DGAT1s are approximately 20 kDa larger than DGAT2s. DGAT1s and DGAT2s have 41 and 16 completely conserved amino acid residues, respectively, although only two of them are shared by all DGATs. These residues are distributed in 7 and 6 sequence blocks for DGAT1s and DGAT2s, respectively, and located at the carboxyl termini, suggesting the location of the catalytic domains. These conserved sequence blocks do not contain the putative neutral lipid-binding domain, mitochondrial targeting signal, or ER retrieval motif. The importance of conserved residues has been demonstrated by site-directed and natural mutants.

Conclusions

This study has identified conserved sequence motifs and amino acid residues in all 117 DGATs and the two subfamilies. None of the completely conserved residues in DGAT1s and DGAT2s is present in recently reported isoforms in the multiple sequences alignment, raising an important question how proteins with completely different amino acid sequences could perform the same biochemical reaction. The sequence analysis should facilitate studying the structure-function relationship of DGATs with the ultimate goal to identify critical amino acid residues for engineering superb enzymes in metabolic engineering and selecting enzyme inhibitors in therapeutic application for obesity and related diseases.  相似文献   

10.
The information required to generate a protein structure is contained in its amino acid sequence, but how three-dimensional information is mapped onto a linear sequence is still incompletely understood. Multiple structure alignments of similar protein structures have been used to investigate conserved sequence features but contradictory results have been obtained, due, in large part, to the absence of subjective criteria to be used in the construction of sequence profiles and in the quantitative comparison of alignment results. Here, we report a new procedure for multiple structure alignment and use it to construct structure-based sequence profiles for similar proteins. The definition of "similar" is based on the structural alignment procedure and on the protein structural distance (PSD) described in paper I of this series, which offers an objective measure for protein structure relationships. Our approach is tested in two well-studied groups of proteins; serine proteases and Ig-like proteins. It is demonstrated that the quality of a sequence profile generated by a multiple structure alignment is quite sensitive to the PSD used as a threshold for the inclusion of proteins in the alignment. Specifically, if the proteins included in the aligned set are too distant in structure from one another, there will be a dilution of information and patterns that are relevant to a subset of the proteins are likely to be lost.In order to understand better how the same three-dimensional information can be encoded in seemingly unrelated sequences, structure-based sequence profiles are constructed for subsets of proteins belonging to nine superfolds. We identify patterns of relatively conserved residues in each subset of proteins. It is demonstrated that the most conserved residues are generally located in the regions where tertiary interactions occur and that are relatively conserved in structure. Nevertheless, the conservation patterns are relatively weak in all cases studied, indicating that structure-determining factors that do not require a particular sequential arrangement of amino acids, such as secondary structure propensities and hydrophobic interactions, are important in encoding protein fold information. In general, we find that similar structures can fold without having a set of highly conserved residue clusters or a well-conserved sequence profile; indeed, in some cases there is no apparent conservation pattern common to structures with the same fold. Thus, when a group of proteins exhibits a common and well-defined sequence pattern, it is more likely that these sequences have a close evolutionary relationship rather than the similarities having arisen from the structural requirements of a given fold.  相似文献   

11.
Lipocalins are functionally diverse proteins that are composed of 120–180 amino acid residues. Members of this family have several important biological functions including ligand transport, cryptic coloration, sensory transduction, endonuclease activity, stress response activity in plants, odorant binding, prostaglandin biosynthesis, cellular homeostasis regulation, immunity, immunotherapy and so on. Identification of lipocalins from protein sequence is more challenging due to the poor sequence identity which often falls below the twilight zone. So far, no specific method has been reported to identify lipocalins from primary sequence. In this paper, we report a support vector machine (SVM) approach to predict lipocalins from protein sequence using sequence-derived properties. LipoPred was trained using a dataset consisting of 325 lipocalin proteins and 325 non-lipocalin proteins, and evaluated by an independent set of 140 lipocalin proteins and 21,447 non-lipocalin proteins. LipoPred achieved 88.61% accuracy with 89.26% sensitivity, 85.27% specificity and 0.74 Matthew’s correlation coefficient (MCC). When applied on the test dataset, LipoPred achieved 84.25% accuracy with 88.57% sensitivity, 84.22% specificity and MCC of 0.16. LipoPred achieved better performance rate when compared with PSI-BLAST, HMM and SVM-Prot methods. Out of 218 lipocalins, LipoPred correctly predicted 194 proteins including 39 lipocalins that are non-homologous to any protein in the SWISSPROT database. This result shows that LipoPred is potentially useful for predicting the lipocalin proteins that have no sequence homologs in the sequence databases. Further, successful prediction of nine hypothetical lipocalin proteins and five new members of lipocalin family prove that LipoPred can be efficiently used to identify and annotate the new lipocalin proteins from sequence databases. The LipoPred software and dataset are available at .  相似文献   

12.
The twin arginine translocation (TAT) system ferries folded proteins across the bacterial membrane. Proteins are directed into this system by the TAT signal peptide present at the amino terminus of the precursor protein, which contains the twin arginine residues that give the system its name. There are currently only two computational methods for the prediction of TAT translocated proteins from sequence. Both methods have limitations that make the creation of a new algorithm for TAT-translocated protein prediction desirable. We have developed TATPred, a new sequence-model method, based on a Nave-Bayesian network, for the prediction of TAT signal peptides. In this approach, a comprehensive range of models was tested to identify the most reliable and robust predictor. The best model comprised 12 residues: three residues prior to the twin arginines and the seven residues that follow them. We found a prediction sensitivity of 0.979 and a specificity of 0.942.  相似文献   

13.
A new approach is introduced for analyzing and ultimately predicting protein structures, defined at the level of C alpha coordinates. We analyze hexamers (oligopeptides of six amino acid residues) and show that their structure tends to concentrate in specific clusters rather than vary continuously. Thus, we can use a limited set of standard structural building blocks taken from these clusters as representatives of the repertoire of observed hexamers. We demonstrate that protein structures can be approximated by concatenating such building blocks. We have identified about 100 building blocks by applying clustering algorithms, and have shown that they can "replace" about 76% of all hexamers in well-refined known proteins with an error of less than 1 A, and can be joined together to cover 99% of the residues. After replacing each hexamer by a standard building block with similar conformation, we can approximately reconstruct the actual structure by smoothly joining the overlapping building blocks into a full protein. The reconstructed structures show, in most cases, high resemblance to the original structure, although using a limited number of building blocks and local criteria of concatenating them is not likely to produce a very precise global match. Since these building blocks reflect, in many cases, some sequence dependency, it may be possible to use the results of this study as a basis for a protein structure prediction procedure.  相似文献   

14.
Certain residues have no known function yet are co-conserved across distantly related protein families and diverse organisms, suggesting that they perform critical roles associated with as-yet-unidentified molecular properties and mechanisms. This raises the question of how to obtain additional clues regarding these mysterious biochemical phenomena with a view to formulating experimentally testable hypotheses. One approach is to access the implicit biochemical information encoded within the vast amount of genomic sequence data now becoming available. Here, a new Gibbs sampling strategy is formulated and implemented that can partition hundreds of thousands of sequences within a major protein class into multiple, functionally-divergent categories based on those pattern residues that best discriminate between categories. The sampler precisely defines the partition and pattern for each category by explicitly modeling unrelated, non-functional and related-yet-divergent proteins that would otherwise obscure the analysis. To aid biological interpretation, auxiliary routines can characterize pattern residues within available crystal structures and identify those structures most likely to shed light on the roles of pattern residues. This approach can be used to define and annotate automatically subgroup-specific conserved domain profiles based on statistically-rigorous empirical criteria rather than on the subjective and labor-intensive process of manual curation. Incorporating such profiles into domain database search sites (such as the NCBI BLAST site) will provide biologists with previously inaccessible molecular information useful for hypothesis generation and experimental design. Analyses of P-loop GTPases and of AAA+ ATPases illustrate the sampler's ability to obtain such information.  相似文献   

15.
Protein-protein interactions govern almost all biological processes and the underlying functions of proteins. The interaction sites of protein depend on the 3D structure which in turn depends on the amino acid sequence. Hence, prediction of protein function from its primary sequence is an important and challenging task in bioinformatics. Identification of the amino acids (hot spots) that leads to the characteristic frequency signifying a particular biological function is really a tedious job in proteomic signal processing. In this paper, we have proposed a new promising technique for identification of hot spots in proteins using an efficient time-frequency filtering approach known as the S-transform filtering. The S-transform is a powerful linear time-frequency representation and is especially useful for the filtering in the time-frequency domain. The potential of the new technique is analyzed in identifying hot spots in proteins and the result obtained is compared with the existing methods. The results demonstrate that the proposed method is superior to its counterparts and is consistent with results based on biological methods for identification of the hot spots. The proposed method also reveals some new hot spots which need further investigation and validation by the biological community.  相似文献   

16.
Lin YS 《Proteins》2008,73(1):53-62
Factors that are related to thermostability of proteins have been extensively studied in recent years, especially by comparing thermophiles and mesophiles. However, most of them are global characters. It is still not clear how to identify specific residues or fragments which may be more relevant to protein thermostability. Moreover, some of the differences among the thermophiles and mesophiles may be due to phylogenetic differences instead of thermal adaptation. To resolve these problems, I adopted a strategy to identify residue substitutions evolved convergently in thermophiles or mesophiles. These residues may therefore be responsible for thermal adaptation. Four classes of genomes were utilized in this study, including thermophilic archaea, mesophilic archaea, thermophilic bacteria, and mesophilic bacteria. For most clusters of orthologous groups (COGs) with sequences from all of these four classes of genomes, I can identify specific residues or fragments that may potentially be responsible for thermal adaptation. Functional or structural constraints (represented as sequence conservation) were suggested to have higher impact on thermal adaptation than secondary structure or solvent accessibility does. I further compared thermophilic archaea and mesophilic bacteria, and found that the most diverged fragments may not necessarily correspond to the thermostability-determining ones. The usual approach to compare thermophiles and mesophiles without considering phylogenetic relationships may roughly identify sequence features contributing to thermostability; however, to specifically identify residue substitutions responsible for thermal adaptation, one should take sequence evolution into consideration.  相似文献   

17.
MOTIVATION: We introduce a novel approach to multiple alignment that is based on an algorithm for rapidly checking whether single matches are consistent with a partial multiple alignment. This leads to a sequence annealing algorithm, which is an incremental method for building multiple sequence alignments one match at a time. Our approach improves significantly on the standard progressive alignment approach to multiple alignment. RESULTS: The sequence annealing algorithm performs well on benchmark test sets of protein sequences. It is not only sensitive, but also specific, drastically reducing the number of incorrectly aligned residues in comparison to other programs. The method allows for adjustment of the sensitivity/specificity tradeoff and can be used to reliably identify homologous regions among protein sequences. AVAILABILITY: An implementation of the sequence annealing algorithm is available at http://bio.math.berkeley.edu/amap/  相似文献   

18.
MOTIVATION: Discovery of regulatory motifs in unaligned DNA sequences remains a fundamental problem in computational biology. Two categories of algorithms have been developed to identify common motifs from a set of DNA sequences. The first can be called a 'multiple genes, single species' approach. It proposes that a degenerate motif is embedded in some or all of the otherwise unrelated input sequences and tries to describe a consensus motif and identify its occurrences. It is often used for co-regulated genes identified through experimental approaches. The second approach can be called 'single gene, multiple species'. It requires orthologous input sequences and tries to identify unusually well conserved regions by phylogenetic footprinting. Both approaches perform well, but each has some limitations. It is tempting to combine the knowledge of co-regulation among different genes and conservation among orthologous genes to improve our ability to identify motifs. RESULTS: Based on the Consensus algorithm previously established by our group, we introduce a new algorithm called PhyloCon (Phylogenetic Consensus) that takes into account both conservation among orthologous genes and co-regulation of genes within a species. This algorithm first aligns conserved regions of orthologous sequences into multiple sequence alignments, or profiles, then compares profiles representing non-orthologous sequences. Motifs emerge as common regions in these profiles. Here we present a novel statistic to compare profiles of DNA sequences and a greedy approach to search for common subprofiles. We demonstrate that PhyloCon performs well on both synthetic and biological data. AVAILABILITY: Software available upon request from the authors. http://ural.wustl.edu/softwares.html  相似文献   

19.
The characterization of protein interactions is essential for understanding biological systems. While genome-scale methods are available for identifying interacting proteins, they do not pinpoint the interacting motifs (e.g., a domain, sequence segments, a binding site, or a set of residues). Here, we develop and apply a method for delineating the interacting motifs of hub proteins (i.e., highly connected proteins). The method relies on the observation that proteins with common interaction partners tend to interact with these partners through a common interacting motif. The sole input for the method are binary protein interactions; neither sequence nor structure information is needed. The approach is evaluated by comparing the inferred interacting motifs with domain families defined for 368 proteins in the Structural Classification of Proteins (SCOP). The positive predictive value of the method for detecting proteins with common SCOP families is 75% at sensitivity of 10%. Most of the inferred interacting motifs were significantly associated with sequence patterns, which could be responsible for the common interactions. We find that yeast hubs with multiple interacting motifs are more likely to be essential than hubs with one or two interacting motifs, thus rationalizing the previously observed correlation between essentiality and the number of interacting partners of a protein. We also find that yeast hubs with multiple interacting motifs evolve slower than the average protein, contrary to the hubs with one or two interacting motifs. The proposed method will help us discover unknown interacting motifs and provide biological insights about protein hubs and their roles in interaction networks.  相似文献   

20.
In nature, 1 out of every 10 proteins has an (alpha/beta)(8) (TIM)-barrel fold, and in most cases, pairwise comparisons show no sequence similarity between them. Hence, delineating the key residues that induce very different sequences to share a common fold is important for understanding the folding and stability of TIM-barrel domains. In this work, we propose a new consensus approach for locating these stabilizing residues based on long-range interactions, hydrophobicity, and conservation of amino acid residues. We have identified 957 stabilizing residues in 63 proteins from a nonredundant set of 71 TIM-barrel domains. Most of these residues are located in the 8-stranded beta-sheet, with nearly one half of them oriented toward the interior of the barrel and the other half oriented toward the surrounding alpha-helices. Several stabilizing residues are found in the N- and C-terminal loops, whereas very few appear in the alpha-helices that surround the internal beta-sheet. Further, these 957 residues are placed in 434 stabilizing segments of various sizes, and each domain contains 1-10 of these segments. We found that 8 segments per domain is the most abundant one, and two thirds of the proteins have 7-9 stabilizing segments. Finally, we verified the identified residues with experimental temperature factors and found that these residues are among the ones with less mobility in the considered proteins. We suggest that our new protocol serves as a powerful tool to identify the stabilizing residues in TIM-barrel domains, which can be used as potential candidates for studying protein folding and stability by means of protein engineering experiments.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号