共查询到20条相似文献,搜索用时 0 毫秒
1.
When preparing data sets of amino acid or nucleotide sequences it is necessary to exclude redundant or homologous sequences in order to avoid overestimating the predictive performance of an algorithm. For some time methods for doing this have been available in the area of protein structure prediction. We have developed a similar procedure based on pair-wise alignments for sequences with functional sites. We show how a correlation coefficient between sequence similarity and functional homology can be used to compare the efficiency of different similarity measures and choose a nonarbitrary threshold value for excluding redundant sequences. The impact of the choice of scoring matrix used in the alignments is examined. We demonstrate that the parameter determining the quality of the correlation is the relative entropy of the matrix, rather than the assumed (PAM or identity) substitution model. Results are presented for the case of prediction of cleavage sites in signal peptides. By inspection of the false positives, several errors in the database were found. The procedure presented may be used as a general outline for finding a problem-specific similarity measure and threshold value for analysis of other functional amino acid or nucleotide sequence patterns. 相似文献
2.
Allosteric interactions between residues that are spatially apart and well separated in sequence are important in the function of multimeric proteins as well as single-domain proteins. This observation suggests that, among the residues that are involved in long-range communications, mutation at one site should affect interactions at a distant site. By adopting a sequence-based approach, we present an automated approach that uses a generalization of the familiar sequence entropy in conjunction with a coupled two-way clustering algorithm, to predict the network of interactions that trigger allosteric interactions in proteins. We use the method to identify the subset of dynamically important residues in three families, namely, the small PDZ family, G protein-coupled receptors (GPCR), and the Lectins, which are cell-adhesion receptors that mediate the tethering and rolling of leukocytes on inflamed endothelium. For the PDZ and GPCR families, our procedure predicts, in agreement with previous studies, a network containing a small number of residues that are involved in their function. Application to the Lectin family reveals a network of residues interspersed throughout the C-terminal end of the structure that are responsible for binding to ligands. Based on our results and previous studies, we propose that functional robustness requires that only a small subset of distantly connected residues be involved in transmitting allosteric signals in proteins. 相似文献
3.
4.
5.
Investigation of the molecular similarity in closely related protein systems: The PrP case study 下载免费PDF全文
The amyloid conversion is a massive detrimental modification affecting several proteins upon specific physical or chemical stimuli characterizing a plethora of diseases. In many cases, the amyloidogenic stimuli induce specific structural features to the protein conferring the propensity to misfold and form amyloid deposits. The investigation of mutants, structurally similar to their native isoform but inherently prone to amyloid conversion, may be a viable strategy to elucidate the structural features connected with amyloidogenesis. In this article, we present a computational protocol based on the combination of molecular dynamics (MD) and grid‐based approaches suited for the pairwise comparison of closely related protein structures. This method was applied on the cellular prion protein (PrPC) as a case study and, in particular, addressed to the quali/quantification of the structural features conferred by either E200K mutations and treatment with CaCl2, both able to induce the scrapie conversion of PrP. Several schemes of comparison were developed and applied to this case study, and made up suitable of application to other protein systems. At this purpose an in‐house python codes has been implemented that, together with the parallelization of the GRID force fields program, will spread the applicability of the proposed computational procedure. Proteins 2015; 83:1751–1765. © 2015 Wiley Periodicals, Inc. 相似文献
6.
Attwood TK 《Briefings in bioinformatics》2002,3(3):252-263
The PRINTS database houses a collection of protein fingerprints, which may be used to assign family and functional attributes to uncharacterised sequences, such as those currently emanating from the various genome-sequencing projects. The April 2002 release includes 1,700 family fingerprints, encoding approximately 10,500 motifs, covering a range of globular and membrane proteins, modular polypeptides and so on. Fingerprints are groups of conserved motifs that, taken together, provide diagnostic protein family signatures. They derive much of their potency from the biological context afforded by matching motif neighbours; this makes them at once more flexible and powerful than single-motif approaches. The technique further departs from other pattern-matching methods by readily allowing the creation of fingerprints at superfamily-, family- and subfamily-specific levels, thereby allowing more fine-grained diagnoses. Here, we provide an overview of the method of protein fingerprinting and how the results of fingerprint analyses are used to build PRINTS and its relational cousin, PRINTS-S. 相似文献
7.
The role of pattern databases in sequence analysis 总被引:2,自引:0,他引:2
Attwood TK 《Briefings in bioinformatics》2000,1(1):45-59
In the wake of the numerous now-fruitful genome projects, we are entering an era rich in biological data. The field of bioinformatics is poised to exploit this information in increasingly powerful ways, but the abundance and growing complexity both of the data and of the tools and resources required to analyse them are threatening to overwhelm us. Databases and their search tools are now an essential part of the research environment. However, the rate of sequence generation and the haphazard proliferation of databases have made it difficult to keep pace with developments. In an age of information overload, researchers want rapid, easy-to-use, reliable tools for functional characterisation of newly determined sequences. But what are those tools? How do we access them? Which should we use? This review focuses on a particular type of database that is increasingly used in the task of routine sequence analysis--the so-called pattern database. The paper aims to provide an overview of the current status of pattern databases in common use, outlining the methods behind them and giving pointers on their diagnostic strengths and weaknesses. 相似文献
8.
It has been observed that the size of protein sequence families is unevenly distributed, with few super families with a large number of members and many "orphan" proteins that do not belong to any family. Here it is shown that the distribution of sizes of protein families in different databases and classifications (Protomap, Prodom, Cog) follows a power-law behavior with similar scaling exponents, which is characteristic of self-organizing systems. Since large databases are used in this study, a more detailed analysis of the data than in previous studies was possible. Hence, it is shown that the size distribution is governed by two exponents, different for the super families and the orphan proteins. A simple model of protein evolution is proposed, in which proteins are dynamically generated and clustered into families. The model yields a scaling behavior very similar to the distribution observed in the actual sequence databases, including the two distinct regimes for the large and small families, and thus suggests that the existence of "super families" of proteins and "orphan" proteins are two manifestations of the same evolutionary process. 相似文献
9.
In this work we examine how protein structural changes are coupled with sequence variation in the course of evolution of a family of homologs. The sequence-structure correlation analysis performed on 81 homologous protein families shows that the majority of them exhibit statistically significant linear correlation between the measures of sequence and structural similarity. We observed, however, that there are cases where structural variability cannot be mainly explained by sequence variation, such as protein families with a number of disulfide bonds. To understand whether structures from different families and/or folds evolve in the same manner, we compared the degrees of structural change per unit of sequence change ("the evolutionary plasticity of structure") between those families with a significant linear correlation. Using rigorous statistical procedures we find that, with a few exceptions, evolutionary plasticity does not show a statistically significant difference between protein families. Similar sequence-structure analysis performed for protein loop regions shows that evolutionary plasticity of loop regions is greater than for the protein core. 相似文献
10.
Protein multiple sequence alignment is an important bioinformatics tool. It has important applications in biological evolution analysis and protein structure prediction. A variety of alignment algorithms in this field have achieved great success. However, each algorithm has its own inherent deficiencies. In this paper, permutation similarity is proposed to evaluate several protein multiple sequence alignment algorithms that are widely used currently. As the permutation similarity method only concerns the relative order of different protein evolutionary distances, without taking into account the slight difference between the evolutionary distances, it can get more robust evaluations. The longest common subsequence method is adopted to define the similarity between different permutations. Using these methods, we assessed Dialign, Tcoffee, ClustalW and Muscle and made comparisons among them. 相似文献
11.
Akinori Kidera Yasuo Konishi Tatsuo Ooi Harold A. Scheraga 《Journal of Protein Chemistry》1985,4(5):265-297
In a previous paper we obtained ten (orthogonal) factors, linear combinations of which can express the properties of the 20 naturally occurring amino acids. In this paper, we assume that the most important properties (linear combinations of these ten factors) that determine the three-dimensional structure of a protein are conserved properties, i.e., are those that have been conserved during evolution. Two definitions of a conserved property are presented: (1) a conserved property for an average protein is defined as that linear combination of the ten factors that optimally expresses the similarity of one amino acid to another (hence, little change during evolution), as given by the relatedness odds matrix of Dayhoff et al.; (2) a conserved property for each position in the amino acid sequence (locus) of a specific family of homologous proteins (the cytochromec family or the globin family) is defined as that linear combination of the ten factors that is common among a set of amino acids at a given locus when the sequences are properly aligned. When the specificity at each locus is averaged over all loci, the same features are observed for three expressions of these two definitions, namely the conserved property for an average protein, the average conserved property for the cytochromec family, and the average conserved property for the globin family; we find that bulk and hydrophobicity (information about packing and long-range interactions) are more important than other properties, such as the preference for adopting a specific backbone structure (information about short-range interactions). We also demonstrate that the sequence profile of a conserved property, defined for each locus of a protein family (definition 2), corresponds uniquely to the three-dimensional structure, while the conserved property for an average protein (definition 1) is not useful for the prediction of protein structure. The amino acid sequences of numerous proteins are searched to find those that are similar, in terms of the conserved properties (definition 2), to sequences of the same size from one of the homologous families (cytochromec and globin, respectively) for whose loci the conserved properties were defined. Many similar sequences are found, the number of similarities decreasing with increasing size of the segment. However, the segments must be rather long (15 residues) before the comparisons become meaningful. As an example, one sufficiently large sequence (20 residues) from a protein of known structure (apo-liver alcohol dehydrogenase that is not a member of either family) is found to be similar in the conserved properties to a particular sequence of a member of the family of human hemoglobin chains, and the two sequences have similar structures. This means that, since conserved properties are expected to be structure determinants, we can use the conserved properties to predict an initial protein structure for subsequent energy minimization for a protein for which the conserved properties are similar to those of a family of proteins with a sufficiently large number of homologous amino acid sequences; such a large number of homologous sequences is required to define a conserved property for each locus of the homologous protein family. 相似文献
12.
13.
Barak Y Handelsman T Nakar D Mechaly A Lamed R Shoham Y Bayer EA 《Journal of molecular recognition : JMR》2005,18(6):491-501
Cellulosomes are multi-enzyme complexes that orchestrate the efficient degradation of cellulose and related plant cell wall polysaccharides. The complex is maintained by the high-affinity protein-protein interaction between two complementary modules: the cohesin and the dockerin. In order to characterize the interaction between different cohesins and dockerins, we have developed matching fusion-protein systems, which harbor either the cohesin or the dockerin component. For this purpose, corresponding plasmid cassettes were designed, which encoded for the following carrier proteins: (i) a thermostable xylanase with an appended His-tag; and (ii) a highly stable cellulose-binding module (CBM). The resultant xylanase-dockerin and CBM-cohesin fusion products exhibited high expression levels of soluble protein. The expressed, affinity-purified proteins were extremely stable, and the functionality of the cohesin or dockerin component was retained. The fusion protein system was used to establish a sensitive and reliable, semi-quantitative enzyme-linked affinity assay for determining multiple samples of cohesin-dockerin interactions in microtiter plates. A variety of cohesin-dockerin systems, which had been examined previously using other methodologies, were revisited applying the affinity-based enzyme assay, the results of which served to verify the validity of the approach. 相似文献
14.
15.
基于知识库的像斑光谱向量相似度土地覆盖变化检测方法 总被引:1,自引:0,他引:1
土地利用/覆盖变化检测是国内外全球化进程研究的重要内容,选择适当的变化检测方法对西北地区土地利用/覆盖变化进行研究在"生态十年项目"中具有重要的意义。选择西北地区具有典型代表性的TM轨道号134033区域作为变化检测方法验证的试验区,采用2005和2010年两期Landsat TM影像,在e Cognition Developer 8.64软件支持下,采用基于像斑的光谱特征特征向量相似度方法进行变化检测,并利用2010年土地覆盖数据作为先验知识库对变化区域分类,提取土地利用/覆盖变化信息,并对变化结果进行定量分析。结果表明,采用基于像斑的光谱特征特征向量相似度方法对于试验区的土地利用/覆盖变化制图具有检测快速、检测精度高等优点,适合试验区以及整个西北地区的土地利用/覆盖变化的检测。最终采用该方法以及分类后比较法获得了西北地区2000—2010年近10年的土地利用/覆盖分类图。 相似文献
16.
A new machine learning algorithm, LESTAT (LEngth and STructure-based sequence Alignment Tool) has been developed for detecting protein homologs having low-sequence identity. LESTAT is an iterative profile-based method that runs without reliance on a predefined library and incorporates several novel features that enhance its ability to identify remote sequences. To overcome the inherent bias associated with a single starting model, LESTAT utilizes three structural homologs to create a profile consisting of structurally conserved positions and block separation distances. Subsequent profiles are refined iteratively using sequence information obtained from previous cycles. Additionally, the refinement process incorporates a "lock-in" feature to retain the high-scoring sequences involved in previous alignments for subsequent model building and an enhancement factor to complement the weighting scheme used to build the position specific scoring matrix. A comparison of the performance of LESTAT against PSI-BLAST for seven systems reveals that LESTAT exhibits increased sensitivity and specificity over PSI-BLAST in six of these systems, based on the number of true homologs detected and the number of families these homologs covered. Notably, many of the hits identified are unique to each method, presumably resulting from the distinct differences in the two approaches. Taken together, these findings suggest that LESTAT is a useful complementary method to PSI-BLAST in the detection of distant homologs. 相似文献
17.
Definition of the tempo of sequence diversity across an alignment and automatic identification of sequence motifs: Application to protein homologous families and superfamilies 下载免费PDF全文
May AC 《Protein science : a publication of the Protein Society》2002,11(12):2825-2835
It is often possible to identify sequence motifs that characterize a protein family in terms of its fold and/or function from aligned protein sequences. Such motifs can be used to search for new family members. Partitioning of sequence alignments into regions of similar amino acid variability is usually done by hand. Here, I present a completely automatic method for this purpose: one that is guaranteed to produce globally optimal solutions at all levels of partition granularity. The method is used to compare the tempo of sequence diversity across reliable three-dimensional (3D) structure-based alignments of 209 protein families (HOMSTRAD) and that for 69 superfamilies (CAMPASS). (The mean alignment length for HOMSTRAD and CAMPASS are very similar.) Surprisingly, the optimal segmentation distributions for the closely related proteins and distantly related ones are found to be very similar. Also, optimal segmentation identifies an unusual protein superfamily. Finally, protein 3D structure clues from the tempo of sequence diversity across alignments are examined. The method is general, and could be applied to any area of comparative biological sequence and 3D structure analysis where the constraint of the inherent linear organization of the data imposes an ordering on the set of objects to be clustered. 相似文献
18.
Searches using position specific scoring matrices (PSSMs) have been commonly used in remote homology detection procedures such as PSI-BLAST and RPS-BLAST. A PSSM is generated typically using one of the sequences of a family as the reference sequence. In the case of PSI-BLAST searches the reference sequence is same as the query. Recently we have shown that searches against the database of multiple family-profiles, with each one of the members of the family used as a reference sequence, are more effective than searches against the classical database of single family-profiles. Despite relatively a better overall performance when compared with common sequence-profile matching procedures, searches against the multiple family-profiles database result in a few false positives and false negatives. Here we show that profile length and divergence of sequences used in the construction of a PSSM have major influence on the performance of multiple profile based search approach. We also identify that a simple parameter defined by the number of PSSMs corresponding to a family that is hit, for a query, divided by the total number of PSSMs in the family can distinguish effectively the true positives from the false positives in the multiple profiles search approach. 相似文献
19.
Comparative docking is based on experimentally determined structures of protein-protein complexes (templates), following the paradigm that proteins with similar sequences and/or structures form similar complexes. Modeling utilizing structure similarity of target monomers to template complexes significantly expands structural coverage of the interactome. Template-based docking by structure alignment can be performed for the entire structures or by aligning targets to the bound interfaces of the experimentally determined complexes. Systematic benchmarking of docking protocols based on full and interface structure alignment showed that both protocols perform similarly, with top 1 docking success rate 26%. However, in terms of the models' quality, the interface-based docking performed marginally better. The interface-based docking is preferable when one would suspect a significant conformational change in the full protein structure upon binding, for example, a rearrangement of the domains in multidomain proteins. Importantly, if the same structure is selected as the top template by both full and interface alignment, the docking success rate increases 2-fold for both top 1 and top 10 predictions. Matching structural annotations of the target and template proteins for template detection, as a computationally less expensive alternative to structural alignment, did not improve the docking performance. Sophisticated remote sequence homology detection added templates to the pool of those identified by structure-based alignment, suggesting that for practical docking, the combination of the structure alignment protocols and the remote sequence homology detection may be useful in order to avoid potential flaws in generation of the structural templates library. 相似文献
20.
The amino-acid sequences of soluble, globular proteins must have hydrophobic residues to form a stable core, but excess sequence hydrophobicity can lead to loss of native state conformational specificity and aggregation. Previous studies of polar-to-hydrophobic mutations in the β-sheet of the Arc repressor dimer showed that a single substitution at position 11 (N11L) leads to population of an alternate dimeric fold in which the β-sheet is replaced by helix. Two additional hydrophobic mutations at positions 9 and 13 (Q9V and R13V) lead to population of a differently folded octamer along with both dimeric folds. Here we conduct a comprehensive study of the sequence determinants of this progressive loss of fold specificity. We find that the alternate dimer-fold specifically results from the N11L substitution and is not promoted by other hydrophobic substitutions in the β-sheet. We also find that three highly hydrophobic substitutions at positions 9, 11, and 13 are necessary and sufficient for oligomer formation, but the oligomer size depends on the identity of the hydrophobic residue in question. The hydrophobic substitutions increase thermal stability, illustrating how increased hydrophobicity can increase folding stability even as it degrades conformational specificity. The oligomeric variants are predicted to be aggregation-prone but may be hindered from doing so by proline residues that flank the β-sheet region. Loss of conformational specificity due to increased hydrophobicity can manifest itself at any level of structure, depending upon the specific mutations and the context in which they occur. 相似文献