Recognizing structural similarity without significant sequence identity has proved to be a challenging task. Sequence-based and structure-based methods as well as their combinations have been developed. Here, we propose a fold-recognition method that incorporates structural information without the need of sequence-to-structure threading. This is accomplished by generating sequence profiles from protein structural fragments. The structure-derived sequence profiles allow a simple integration with evolution-derived sequence profiles and secondary-structural information for an optimized alignment by efficient dynamic programming. The resulting method (called SP(3)) is found to make a statistically significant improvement in both sensitivity of fold recognition and accuracy of alignment over the method based on evolution-derived sequence profiles alone (SP) and the method based on evolution-derived sequence profile and secondary structure profile (SP(2)). SP(3) was tested in SALIGN benchmark for alignment accuracy and Lindahl, PROSPECTOR 3.0, and LiveBench 8.0 benchmarks for remote-homology detection and model accuracy. SP(3) is found to be the most sensitive and accurate single-method server in all benchmarks tested where other methods are available for comparison (although its results are statistically indistinguishable from the next best in some cases and the comparison is subjected to the limitation of time-dependent sequence and/or structural library used by different methods.). In LiveBench 8.0, its accuracy rivals some of the consensus methods such as ShotGun-INBGU, Pmodeller3, Pcons4, and ROBETTA. SP(3) fold-recognition server is available on http://theory.med.buffalo.edu.  相似文献   

Using a recently developed protein folding algorithm, a prediction of the tertiary structure of the KIX domain of the CREB binding protein is described. The method incorporates predicted secondary and tertiary restraints derived from multiple sequence alignments in a reduced protein model whose conformational space is explored by Monte Carlo dynamics. Secondary structure restraints are provided by the PHD secondary structure prediction algorithm that was modified for the presence of predicted U-turns, i.e., regions where the chain reverses global direction. Tertiary restraints are obtained via a two-step process: First, seed side-chain contacts are identified from a correlated mutation analysis, and then, a threading-based algorithm expands the number of these seed contacts. Blind predictions indicate that the KIX domain is a putative three-helix bundle, although the chirality of the bundle could not be uniquely determined. The expected root-mean-square deviation for the correct chirality of the KIX domain is between 5.0 and 6.2 Å. This is to be compared with the estimate of 12.9 Å that would be expected by a random prediction, using the model of F. Cohen and M. Sternberg (J. Mol. Biol. 138:321–333, 1980). Proteins 30:287–294, 1998. © 1998 Wiley-Liss, Inc.  相似文献   

Cozzetto D  Tramontano A 《Proteins》2005,58(1):151-157
Comparative modeling is the method of choice, whenever applicable, for protein structure prediction, not only because of its higher accuracy compared to alternative methods, but also because it is possible to estimate a priori the quality of the models that it can produce, thereby allowing the usefulness of a model for a given application to be assessed beforehand. By and large, the quality of a comparative model depends on two factors: the extent of structural divergence between the target and the template and the quality of the sequence alignment between the two protein sequences. The latter is usually derived from a multiple sequence alignment (MSA) of as many proteins of the family as possible, and its accuracy depends on the number and similarity distribution of the sequences of the protein family. Here we describe a method to evaluate the expected difficulty, and by extension accuracy, of a comparative model on the basis of the MSA used to build it. The parameter that we derive is used to compare the results obtained in the last two editions of the Critical Assessment of Methods for Structure Prediction (CASP) experiment as a function of the difficulty of the modeling exercise. Our analysis demonstrates that the improvement in the scope and quality of comparative models between the two experiments is largely due to the increased number of available protein sequences and to the consequent increased chance that a large and appropriately spaced set of protein sequences homologous to the proteins of interest is available.  相似文献   

The function of a protein molecule is greatly influenced by its three-dimensional (3D) structure and therefore structure prediction will help identify its biological function. We have updated Sequence, Motif and Structure (SMS), the database of structurally rigid peptide fragments, by combining amino acid sequences and the corre-sponding 3D atomic coordinates of non-redundant (25%) and redundant (90%) protein chains available in the Protein Data Bank (PDB). SMS 2.0 provides information pertaining to the peptide fragments of length 5-14 resi-dues. The entire dataset is divided into three categories, namely, same sequence motifs having similar, intermedi-ate or dissimilar 3D structures. Further, options are provided to facilitate structural superposition using the pro-gram structural alignment of multiple proteins (STAMP) and the popular JAVA plug-in (Jmol) is deployed for visualization. In addition, functionalities are provided to search for the occurrences of the sequence motifs in other structural and sequence databases like PDB, Genome Database (GDB), Protein Information Resource (PIR) and Swiss-Prot. The updated database along with the search engine is available over the World Wide Web through the following URL http://cluster.physics.iisc.ernet.in/sms/.  相似文献   

We present a novel method for the comparison of multiple protein alignments with assessment of statistical significance (COMPASS). The method derives numerical profiles from alignments, constructs optimal local profile-profile alignments and analytically estimates E-values for the detected similarities. The scoring system and E-value calculation are based on a generalization of the PSI-BLAST approach to profile-sequence comparison, which is adapted for the profile-profile case. Tested along with existing methods for profile-sequence (PSI-BLAST) and profile-profile (prof_sim) comparison, COMPASS shows increased abilities for sensitive and selective detection of remote sequence similarities, as well as improved quality of local alignments. The method allows prediction of relationships between protein families in the PFAM database beyond the range of conventional methods. Two predicted relations with high significance are similarities between various Rossmann-type folds and between various helix-turn-helix-containing families. The potential value of COMPASS for structure/function predictions is illustrated by the detection of an intricate homology between the DNA-binding domain of the CTF/NFI family and the MH1 domain of the Smad family.  相似文献   

We describe a new strategy for utilizing multiple sequence alignment information to detect distant relationships in searches of sequence databases. A single sequence representing a protein family is enriched by replacing conserved regions with position-specific scoring matrices (PSSMs) or consensus residues derived from multiple alignments of family members. In comprehensive tests of these and other family representations, PSSM-embedded queries produced the best results overall when used with a special version of the Smith-Waterman searching algorithm. Moreover, embedding consensus residues instead of PSSMs improved performance with readily available single sequence query searching programs, such as BLAST and FASTA. Embedding PSSMs or consensus residues into a representative sequence improves searching performance by extracting multiple alignment information from motif regions while retaining single sequence information where alignment is uncertain.  相似文献   

Information is often encoded as an aperiodic chain of building blocks. Modern digital computers use bits as the building blocks, but in general the choice of building blocks depends on the nature of the information to be encoded. What are the optimal building blocks to encode structural information? This can be analysed by substituting the operations of addition and multiplication of conventional arithmetic with translation and rotation. It is argued that at the molecular level, the best component for encoding discretized structural information is carbon. Living organisms discovered this billions of years ago, and used carbon as the back-bone for constructing proteins that function according to their structure. Structural analysis of polypeptide chains shows that an efficient and versatile structural language of 20 building blocks is needed to implement all the tasks carried out by proteins. Properties of amino acids indicate that the present triplet genetic code was preceded by a more primitive one, coding for 10 amino acids using two nucleotide bases.  相似文献   

Wu S  Zhang Y 《Proteins》2008,72(2):547-556
We develop a new threading algorithm MUSTER by extending the previous sequence profile-profile alignment method, PPA. It combines various sequence and structure information into single-body terms which can be conveniently used in dynamic programming search: (1) sequence profiles; (2) secondary structures; (3) structure fragment profiles; (4) solvent accessibility; (5) dihedral torsion angles; (6) hydrophobic scoring matrix. The balance of the weighting parameters is optimized by a grading search based on the average TM-score of 111 training proteins which shows a better performance than using the conventional optimization methods based on the PROSUP database. The algorithm is tested on 500 nonhomologous proteins independent of the training sets. After removing the homologous templates with a sequence identity to the target >30%, in 224 cases, the first template alignment has the correct topology with a TM-score >0.5. Even with a more stringent cutoff by removing the templates with a sequence identity >20% or detectable by PSI-BLAST with an E-value <0.05, MUSTER is able to identify correct folds in 137 cases with the first model of TM-score >0.5. Dependent on the homology cutoffs, the average TM-score of the first threading alignments by MUSTER is 5.1-6.3% higher than that by PPA. This improvement is statistically significant by the Wilcoxon signed rank test with a P-value < 1.0 x 10(-13), which demonstrates the effect of additional structural information on the protein fold recognition. The MUSTER server is freely available to the academic community at http://zhang.bioinformatics.ku.edu/MUSTER.  相似文献   

The dispositions of 39 alpha helices of greater than 2.5 turns and four beta sheets in the major capsid protein (VP5, 149 kDa) of herpes simplex virus type 1 were identified by computational and visualization analysis from the 8.5A electron cryomicroscopy structure of the whole capsid. The assignment of helices in the VP5 upper domain was validated by comparison with the recently determined crystal structure of this region. Analysis of the spatial arrangement of helices in the middle domain of VP5 revealed that the organization of a tightly associated bundle of ten helices closely resembled that of a domain fold found in the annexin family of proteins. Structure-based sequence searches suggested that sequences in both the N and C-terminal portions of the VP5 sequence contribute to this domain. The long helices seen in the floor domain of VP5 form an interconnected network within and across capsomeres. The combined structural and sequence-based informatics has led to an architectural model of VP5. This model placed in the context of the capsid provides insights into the strategies used to achieve viral capsid stability.  相似文献   

Nair R  Rost B 《Proteins》2003,53(4):917-930
The native sub-cellular compartment of a protein is one aspect of its function. Thus, predicting localization is an important step toward predicting function. Short zip code-like sequence fragments regulate some of the shuttling between compartments. Cataloguing and predicting such motifs is the most accurate means of determining localization in silico. However, only few motifs are currently known, and not all the trafficking appears regulated in this way. The amino acid composition of a protein correlates with its localization. All general prediction methods employed this observation. Here, we explored the evolutionary information contained in multiple alignments and aspects of protein structure to predict localization in absence of homology and targeting motifs. Our final system combined statistical rules and a variety of neural networks to achieve an overall four-state accuracy above 65%, a significant improvement over systems using only composition. The system was at its best for extra-cellular and nuclear proteins; it was significantly less accurate than TargetP for mitochondrial proteins. Interestingly, all methods that were developed on SWISS-PROT sequences failed grossly when fed with sequences from proteins of known structures taken from PDB. We therefore developed two separate systems: one for proteins of known structure and one for proteins of unknown structure. Finally, we applied the PDB-based system along with homology-based inferences and automatic text analysis to annotate all eukaryotic proteins in the PDB (http://cubic.bioc.columbia.edu/db/LOC3D). We imagine that this pilot method-certainly in combination with similar tools-may be valuable target selection in structural genomics.  相似文献   

A subset of eukaryotic aminoacyl-tRNA synthetases (a-RS) are contained in a multienzyme complex for which little structural detail is known. Three reversible chemical crosslinking reagents have been used to investigate the arrangement of polypeptides within this particle as isolated from rabbit reticulocytes. Identification of the crosslinked protein pairs was accomplished by two-dimensional SDS diagonal gel electrophoresis. Seventeen neighboring protein pairs have been identified. Eight are seen with at least two reagents: K-RS:p38, D-RS:K-RS, R-RS dimer, K-RS dimer, K-RS:Q-RS, E/P-RS:K-RS, E/P-RS:I-RS, and Q-RS with one of the nonsynthetase proteins. Nine more are observed with one reagent: D-RS dimer, R-RS:p43, D-RS:Q-RS, D-RS:M-RS, K-RS:L-RS, I-RS:R-RS, D-RS:E/P-RS, I-RS:Q-RS, I-RS:L-RS. One trimeric association is seen: E/P-RS:I-RS:L-RS. The observed neighboring protein pairs suggest that the polypeptides within the aminoacyl-tRNA synthetase complex are distributed in three structural domains of similar mass. These can be arranged in a U-shaped particle in which each "arm" is considered a domain and the third forms the "base" of the structure. The arms have been termed domain I (D-RS, M-RS, Q-RS) and domain II (K-RS, R-RS), with domain III (E/P-RS, I-RS, L-RS) assigned to the base. The smaller proteins (p38, p43) may bridge the domains. This proposed spatial relationship of these domains, as well as their compositions, are consistent with earlier studies. Thus, this study provides an initial three-dimensional working model of the arrangement of polypeptides within the multienzyme aminoacyl-tRNA synthetase complex.  相似文献   

Fitzkee NC  Fleming PJ  Rose GD 《Proteins》2005,58(4):852-854
Approximately half the structure of folded proteins is either alpha-helix or beta-strand. We have developed a convenient repository of all remaining structure after these two regular secondary structure elements are removed. The Protein Coil Library (http://roselab.jhu.edu/coil/) allows rapid and comprehensive access to non-alpha-helix and non-beta-strand fragments contained in the Protein Data Bank (PDB). The library contains both sequence and structure information together with calculated torsion angles for both the backbone and side chains. Several search options are implemented, including a query function that uses output from popular PDB-culling servers directly. Additionally, several popular searches are stored and updated for immediate access. The library is a useful tool for exploring conformational propensities, turn motifs, and a recent model of the unfolded state.  相似文献   

One of the major bottlenecks in many ab initio protein structure prediction methods is currently the selection of a small number of candidate structures for high‐resolution refinement from large sets of low‐resolution decoys. This step often includes a scoring by low‐resolution energy functions and a clustering of conformations by their pairwise root mean square deviations (RMSDs). As an efficient selection is crucial to reduce the overall computational cost of the predictions, any improvement in this direction can increase the overall performance of the predictions and the range of protein structures that can be predicted. We show here that the use of structural profiles, which can be predicted with good accuracy from the amino acid sequences of proteins, provides an efficient means to identify good candidate structures. Proteins 2010. © 2009 Wiley‐Liss, Inc.  相似文献   

The methods of clustering and ordination were compared with polynomial and segmented regression methods by application to the pollen and diatom profiles from two sediment cores obtained from lakes susceptible to acid precipitation. Clustering and ordination methods have previously been used to determine zones in sediment profiles, but regression methods which summarize the changes with depth in terms of one or more smooth curves, explicitly use the depth information. Plots of running means were also used to characterize profile shapes. The latter two methods provided a clearer understanding of the changes in diatom and pollen levels in the cores. For low diatom concentrations, non-parametric methods were used to test for a change in concentration with depth. Changes in dry weight of sediment, different bases for concentration and depth, and the effect of poor represententation of an extreme group on the pH spectrum were also considered. The regression procedures were shown to provide summaries useful for comparison of different species or of the same species in different cores. Finally, a summary is given of the similarity of the patterns in the depth profiles of eight pollen types and the non-rare diatom species in one core from each of Kejimkujik and Beaverskin Lakes, respectively.  相似文献   

Arginine decarboxylase (ADC) and ornithine decarboxylase (ODC) are involved in the biosynthesis of putrescine, which is the precursor of other polyamines in animals, plants, and bacteria. These pyridoxal-5'-phosphate-dependent decarboxylases belong to the alanine racemase (AR) structural family together with diaminopimelate decarboxylase (DapDC), which catalyzes the final step of lysine biosynthesis in bacteria. We have constructed a multiple-sequence alignment of decarboxylases in the AR structural family and, based on the alignment, inferred phylogenetic trees. The phylogenetic tree consists of 3 distinct clades formed by ADC, DapDC, and ODC that diverged from an ancestral decarboxylase. The ancestral decarboxylase probably was able to recognize several substrates, and in archaea and bacteria, ODC may have retained the ability to bind other amino acids. Previously, a paralogue of ODC has been proposed to account for ADC activity detected in mammalian cells. According to our results, this appears unlikely, emphasizing the need for more caution in functional assignment made using sequence data and illustrating the continuing value of phylogenetic analysis in clarifying relationships and putative functions.  相似文献   

Family 2 polysaccharide lyases (PL2s) preferentially catalyze the β-elimination of homogalacturonan using transition metals as catalytic cofactors. PL2 is divided into two subfamilies that have been generally associated with secretion, Mg2+ dependence, and endolysis (subfamily 1) and with intracellular localization, Mn2+ dependence, and exolysis (subfamily 2). When present within a genome, PL2 genes are typically found as tandem copies, which suggests that they provide complementary activities at different stages along a catabolic cascade. This relationship most likely evolved by gene duplication and functional divergence (i.e. neofunctionalization). Although the molecular basis of subfamily 1 endolytic activity is understood, the adaptations within the active site of subfamily 2 enzymes that contribute to exolysis have not been determined. In order to investigate this relationship, we have conducted a comparative enzymatic analysis of enzymes dispersed within the PL2 phylogenetic tree and elucidated the structure of VvPL2 from Vibrio vulnificus YJ016, which represents a transitional member between subfamiles 1 and 2. In addition, we have used ancestral sequence reconstruction to functionally investigate the segregated evolutionary history of PL2 progenitor enzymes and illuminate the molecular evolution of exolysis. This study highlights that ancestral sequence reconstruction in combination with the comparative analysis of contemporary and resurrected enzymes holds promise for elucidating the origins and activities of other carbohydrate active enzyme families and the biological significance of cryptic metabolic pathways, such as pectinolysis within the zoonotic marine pathogen V. vulnificus.  相似文献   

As the largest fraction of any proteome does not carry out enzymatic functions, and in order to leverage 3D structural data for the annotation of increasingly higher volumes of sequence data, we wanted to assess the strength of the link between coarse grained structural data (i.e., homologous superfamily level) and the enzymatic versus non-enzymatic nature of protein sequences. To probe this relationship, we took advantage of 41 phylogenetically diverse (encompassing 11 distinct phyla) genomes recently sequenced within the GEBA initiative, for which we integrated structural information, as defined by CATH, with enzyme level information, as defined by Enzyme Commission (EC) numbers. This analysis revealed that only a very small fraction (about 1%) of domain sequences occurring in the analyzed genomes was found to be associated with homologous superfamilies strongly indicative of enzymatic function. Resorting to less stringent criteria to define enzyme versus non-enzyme biased structural classes or excluding highly prevalent folds from the analysis had only modest effect on this proportion. Thus, the low genomic coverage by structurally anchored protein domains strongly associated to catalytic activities indicates that, on its own, the power of coarse grained structural information to infer the general property of being an enzyme is rather limited.  相似文献   

Proteins that can interact with multiple partners play central roles in the network of protein-protein interactions. They are called hub proteins, and recently it was suggested that an abundance of intrinsically disordered regions on their surfaces facilitates their binding to multiple partners. However, in those studies, the hub proteins were identified as proteins with multiple partners, regardless of whether the interactions were transient or permanent. As a result, a certain number of hub proteins are subunits of stable multi-subunit proteins, such as supramolecules. It is well known that stable complexes and transient complexes have different structural features, and thus the statistics based on the current definition of hub proteins will hide the true nature of hub proteins. Therefore, in this paper, we first describe a new approach to identify proteins with multiple partners dynamically, using the Protein Data Bank, and then we performed statistical analyses of the structural features of these proteins. We refer to the proteins as transient hub proteins or sociable proteins, to clarify the difference with hub proteins. As a result, we found that the main difference between sociable and nonsociable proteins is not the abundance of disordered regions, in contrast to the previous studies, but rather the structural flexibility of the entire protein. We also found greater predominance of charged and polar residues in sociable proteins than previously reported.  相似文献   

