共查询到20条相似文献,搜索用时 0 毫秒
1.
Zakharia M. Frenkel Zeev M. Frenkel Edward N. Trifonov Sagi Snir 《Journal of theoretical biology》2009,260(3):438-444
A novel approach for evaluation of sequence relatedness via a network over the sequence space is presented. This relatedness is quantified by graph theoretical techniques. The graph is perceived as a flow network, and flow algorithms are applied. The number of independent pathways between nodes in the network is shown to reflect structural similarity of corresponding protein fragments. These results provide an appropriate parameter for quantitative estimation of such relatedness, as well as reliability of the prediction. They also demonstrate a new potential for sequence analysis and comparison by means of the flow network in the sequence space. 相似文献
2.
Evolutionary transitions in protein fold space 总被引:6,自引:6,他引:0
Taylor WR 《Current opinion in structural biology》2007,17(3):354-361
With the number of known protein folds potentially approaching completion, the problems associated with their systematic classification are evaluated. It is argued that it will be difficult, if not impossible, to find a general metric based on pairwise comparison that will provide a satisfactory classification. It is suggested that some progress may be made through comparison against a library of idealised 'template' folds, but a proper solution can only be attained if this includes a model of the underlying evolutionary processes. These processes are considered with examples of some unexpected relationships among folds, including circular permutations. The problem is finally set in the wider context of the genetic environment, introducing complications relating to introns, gene fixation and population size. 相似文献
3.
The process of protein engineering is currently evolving towards a heuristic understanding of the sequence-function relationship. Improved DNA sequencing capacity, efficient protein function characterization and improved quality of data points in conjunction with well-established statistical tools from other industries are changing the protein engineering field. Algorithms capturing the heuristic sequence-function relationships will have a drastic impact on the field of protein engineering. In this review, several alternative approaches to quantitatively assess sequence space are discussed and the relatively few examples of wet-lab validation of statistical sequence-function characterization/correlation are described. 相似文献
4.
Konstantinos Blekas Dimitrios I Fotiadis Aristidis Likas 《Journal of computational biology》2005,12(1):64-82
We present a system for multi-class protein classification based on neural networks. The basic issue concerning the construction of neural network systems for protein classification is the sequence encoding scheme that must be used in order to feed the neural network. To deal with this problem we propose a method that maps a protein sequence into a numerical feature space using the matching scores of the sequence to groups of conserved patterns (called motifs) into protein families. We consider two alternative ways for identifying the motifs to be used for feature generation and provide a comparative evaluation of the two schemes. We also evaluate the impact of the incorporation of background features (2-grams) on the performance of the neural system. Experimental results on real datasets indicate that the proposed method is highly efficient and is superior to other well-known methods for protein classification. 相似文献
5.
Following the original idea of Maynard Smith on evolution of the protein sequence space, a novel tool is developed that allows the "space walk", from one sequence to its likely evolutionary relative and further on. At a given threshold of identity between consecutive steps, the walks of many steps are possible. The sequences at the ends of the walks may substantially differ from one another. In a sequence space of randomized (shuffled) sequences the walks are very short. The approach opens new perspectives for protein evolutionary studies and sequence annotation. 相似文献
6.
It is generally accepted that many different protein sequences have similar folded structures, and that there is a relatively high probability that a new sequence possesses a previously observed fold. An indirect consequence of this is that protein design should define the sequence space accessible to a given structure, rather than providing a single optimized sequence. We have recently developed a new approach for protein sequence design, which optimizes the complete sequence of a protein based on the knowledge of its backbone structure, its amino acid composition and a physical energy function including van der Waals interactions, electrostatics, and environment free energy. The specificity of the designed sequence for its template backbone is imposed by keeping the amino acid composition fixed. Here, we show that our procedure converges in sequence space, albeit not to the native sequence of the protein. We observe that while polar residues are well conserved in our designed sequences, non-polar amino acids at the surface of a protein are often replaced by polar residues. The designed sequences provide a multiple alignment of sequences that all adopt the same three-dimensional fold. This alignment is used to derive a profile matrix for chicken triose phosphate isomerase, TIM. The matrix is found to recognize significantly the native sequence for TIM, as well as closely related sequences. Possible application of this approach to protein fold recognition is discussed. 相似文献
7.
J. Hanke G. Beckmann P. Bork J. G. Reich 《Protein science : a publication of the Protein Society》1996,5(1):72-82
We present a method based on hierarchical self-organizing maps (SOMs) for recognizing patterns in protein sequences. The method is fully automatic, does not require prealigned sequences, is insensitive to redundancy in the training set, and works surprisingly well even with small learning sets. Because it uses unsupervised neural networks, it is able to extract patterns that are not present in all of the unaligned sequences of the learning set. The identification of these patterns in sequence databases is sensitive and efficient. The procedure comprises three main training stages. In the first stage, one SOM is trained to extract common features from the set of unaligned learning sequences. A feature is a number of ungapped sequence segments (usually 4-16 residues long) that are similar to segments in most of the sequences of the learning set according to an initial similarity matrix. In the second training stage, the recognition of each individual feature is refined by selecting an optimal weighting matrix out of a variety of existing amino acid similarity matrices. In a third stage of the SOM procedure, the position of the features in the individual sequences is learned. This allows for variants with feature repeats and feature shuffling. The procedure has been successfully applied to a number of notoriously difficult cases with distinct recognition problems: helix-turn-helix motifs in DNA-binding proteins, the CUB domain of developmentally regulated proteins, and the superfamily of ribokinases. A comparison with the established database search procedure PROFILE (and with several others) led to the conclusion that the new automatic method performs satisfactorily. 相似文献
8.
The organization of protein structures in protein genotype space is well studied. The same does not hold for protein functions, whose organization is important to understand how novel protein functions can arise through blind evolutionary searches of sequence space. In systems other than proteins, two organizational features of genotype space facilitate phenotypic innovation. The first is that genotypes with the same phenotype form vast and connected genotype networks. The second is that different neighborhoods in this space contain different novel phenotypes. We here characterize the organization of enzymatic functions in protein genotype space, using a data set of more than 30,000 proteins with known structure and function. We show that different neighborhoods of genotype space contain proteins with very different functions. This property both facilitates evolutionary innovation through exploration of a genotype network, and it constrains the evolution of novel phenotypes. The phenotypic diversity of different neighborhoods is caused by the fact that some functions can be carried out by multiple structures. We show that the space of protein functions is not homogeneous, and different genotype neighborhoods tend to contain a different spectrum of functions, whose diversity increases with increasing distance of these neighborhoods in sequence space. Whether a protein with a given function can evolve specific new functions is thus determined by the protein's location in sequence space. 相似文献
9.
From protein sequence space to elementary protein modules 总被引:2,自引:0,他引:2
The formatted protein sequence space is built from identical size fragments of prokaryotic proteins (112 complete proteomes). Connecting sequence-wise similar fragments (points in the space) results in the formation of numerous networks, that combine sometimes different types of proteins sharing, though, fragments with similar or distantly related sequences. The networks are mapped on individual protein sequences revealing distinct regions (modules) associated with prominent networks with well-defined functional identities. Presence of multiple sites of sequence conservation (modules) in a given protein sequence suggests that the annotated protein function may be decomposed in "elementary" subfunctions of the respective modules. The modules correspond to previously discovered conserved closed loop structures and their sequence prototypes. 相似文献
10.
《Journal of Fermentation and Bioengineering》1995,79(2):107-118
A landscape in protein sequence space shows the relationship between the primary structure and the level of a property of each protein. We developed methods for observing local landscapes experimentally using catalase I from Bacillus stearothermophilus with respect to its catalatic activity, peroxidatic activity, and thermostability. The enzyme gene was randomly mutated and a mutant library composed of 2648 transformants was obtained. Based on the activity and productivity of these transformants, 82 were selected as a sample group for measuring the altitude of catalase I. The altitude of the wild-type enzyme is close to the highest level in the mutant population for the thermostability landscape, but is at the average level for the peroxidatic activity. As for the catalatic activity, its altitude lies in between the two positions. A positive correlation was found between the altitudes of the catalatic and the peroxidatic activities, indicating that the locations of the hills and valleys in the landscapes of the two activities roughly correspond with each other. In contrast, the thermostability landscape appeared quite differently. The smoothness of the landscape was examined via the number of mutations in the structural genes of the mutant enzymes of different properties. The correlation between the number of mutations and the level of each property showed that the thermostability landscape is smooth, but not the two activity landscapes. Thus, the results show that even from a rough sketch of the landscapes based on the experimental data, the characteristic features of catalase I can be elucidated. The sketch of a landscape, therefore, provides a new view in understanding enzymes. 相似文献
11.
Naturally occurring proteins comprise a special subset of all plausible sequences and structures selected through evolution. Simulating protein evolution with simplified and all-atom models has shed light on the evolutionary dynamics of protein populations, the nature of evolved sequences and structures, and the extent to which today's proteins are shaped by selection pressures on folding, structure and function. Extensive mapping of the native structure, stability and folding rate in sequence space using lattice proteins has revealed organizational principles of the sequence/structure map important for evolutionary dynamics. Evolutionary simulations with lattice proteins have highlighted the importance of fitness landscapes, evolutionary mechanisms, population dynamics and sequence space entropy in shaping the generic properties of proteins. Finally, evolutionary-like simulations with all-atom models, in particular computational protein design, have helped identify the dominant selection pressures on naturally occurring protein sequences and structures. 相似文献
12.
Babajide A Farber R Hofacker IL Inman J Lapedes AS Stadler PF 《Journal of theoretical biology》2001,212(1):35-46
Knowledge-based potentials can be used to decide whether an amino acid sequence is likely to fold into a prescribed native protein structure. We use this idea to survey the sequence-structure relations in protein space. In particular, we test the following two propositions which were found to be important for efficient evolution: the sequences folding into a particular native fold form extensive neutral networks that percolate through sequence space. The neutral networks of any two native folds approach each other to within a few point mutations. Computer simulations using two very different potential functions, M. Sippl's PROSA pair potential and a neural network based potential, are used to verify these claims. 相似文献
13.
Background
It is a major challenge of computational biology to provide a comprehensive functional classification of all known proteins. Most existing methods seek recurrent patterns in known proteins based on manually-validated alignments of known protein families. Such methods can achieve high sensitivity, but are limited by the necessary manual labor. This makes our current view of the protein world incomplete and biased. This paper concerns ProtoNet, a automatic unsupervised global clustering system that generates a hierarchical tree of over 1,000,000 proteins, based solely on sequence similarity.Results
In this paper we show that ProtoNet correctly captures functional and structural aspects of the protein world. Furthermore, a novel feature is an automatic procedure that reduces the tree to 12% its original size. This procedure utilizes only parameters intrinsic to the clustering process. Despite the substantial reduction in size, the system's predictive power concerning biological functions is hardly affected. We then carry out an automatic comparison with existing functional protein annotations. Consequently, 78% of the clusters in the compressed tree (5,300 clusters) get assigned a biological function with a high confidence. The clustering and compression processes are unsupervised, and robust.Conclusions
We present an automatically generated unbiased method that provides a hierarchical classification of all currently known proteins.14.
Gu X 《Genetics》2007,175(4):1813-1822
In this article, we develop an evolutionary model for protein sequence evolution. Gene pleiotropy is characterized by K distinct but correlated components (molecular phenotypes) that affect the organismal fitness. These K molecular phenotypes are under stabilizing selection with microadaptation (SM) due to random optima shifts, the SM model. Random coding mutations generate a correlated distribution of K molecular phenotypes. Under this SM model, we further develop a statistical method to estimate the "effective" number of molecular phenotypes (K(e)) of the gene. Therefore, for the first time we can empirically evaluate gene pleiotropy from the protein sequence analysis. Case studies of vertebrate proteins indicate that K(e) is typically approximately 6-9. We demonstrate that the newly developed SM model of protein evolution may provide a basis for exploring genomic evolution and correlations. 相似文献
15.
The N-degrons, a set of degradation signals recognized by the N-end rule pathway, comprise a protein's destabilizing N-terminal residue and an internal lysine residue. We show that the strength of an N-degron can be markedly increased, without loss of specificity, through the addition of lysine residues. A nearly exhaustive screen was carried out for N-degrons in the lysine (K)-asparagine (N) sequence space of the 14-residue peptides containing either K or N (16 384 different sequences). Of these sequences, 68 were found to function as N-degrons, and three of them were at least as active and specific as any of the previously known N-degrons. All 68 K/N-based N-degrons lacked the lysine at position 2, and all three of the strongest N-degrons contained lysines at positions 3 and 15. The results support a model of the targeting mechanism in which the binding of the E3-E2 complex to the substrate's destabilizing N-terminal residue is followed by a stochastic search for a sterically suitable lysine residue. Our strategy of screening a small library that encompasses the entire sequence space of two amino acids should be of use in many settings, including studies of protein targeting and folding. 相似文献
16.
Designating amino-acid sequences that fold into a common main-chain structure as "neutral sequences" for the structure, regardless of their function or stability, we investigated the distribution of neutral sequences in protein sequence space. For four distinct target structures (alpha, beta,alpha/beta and alpha+beta types) with the same chain length of 108, we generated the respective neutral sequences by using the inverse folding technique with a knowledge-based potential function. We assumed that neutral sequences for a protein structure have Z scores higher than or equal to fixed thresholds, where thresholds are defined as the Z score for the corresponding native sequence (case 1) or much greater Z score (case 2). An exploring walk simulation suggested that the neutral sequences mapped into the sequence space were connected with each other through straight neutral paths and formed an inherent neutral network over the sequence space. Through another exploring walk simulation, we investigated contiguous regions between or among the neutral networks for the distinct protein structures and obtained the following results. The closest approach distance between the two neutral networks ranged from 5 to 29 on the Hamming distance scale, showing a linear increase against the threshold values. The sequences located at the "interchange" regions between the two neutral networks have intermediate sequence-profile-scores for both corresponding structures. Introducing a "ball" in the sequence space that contains at least one neutral sequence for each of the four structures, we found that the minimal radius of the ball that is centered at an arbitrary position ranged from 35 to 50, while the minimal radius of the ball that is centered at a certain special position ranged from 20 to 30, in the Hamming distance scale. The relatively small Hamming distances (5-30) may support an evolution mechanism by transferring from a network for a structure to another network for a more beneficial structure via the interchange regions. 相似文献
17.
We have mined the evolutionary record for the large family of intracellular lipid-binding proteins (iLBPs) by calculating the statistical coupling of residue variations in a multiple sequence alignment using methods developed by Ranganathan and coworkers (Lockless and Ranganathan, Science 1999:286;295-299). The 213 sequences analyzed have a wide range of ligand-binding functions as well as highly divergent phylogenetic origins, assuring broad sampling of sequence space. Emerging from this analysis were two major clusters of coupled residues, which when mapped onto the structure of a representative iLBP under study in our laboratory, cellular retinoic-acid binding protein I, are largely contiguous and provide useful points of comparison to available data for the folding of this protein. One cluster comprises a predominantly hydrophobic core away from the ligand-binding site and likely represents key structural information for the iLBP fold. The other cluster includes the portal region where ligand enters its binding site, regions of the ligand-binding cavity, and the region where the 10-stranded beta-barrel characteristic of this family closes (between strands 1' and 10). Linkages between these two clusters suggest that evolutionary pressures on this family constrain structural and functional sequence information in an interdependent fashion. The necessity of the structure to wrap around a hydrophobic ligand confounds the typical sequestration of hydrophobic side chains. Additionally, ligand entry and exit require these structures to have a capacity for specific conformational change during binding and release. We conclude that an essential and structurally apparent separation of local and global sequence information is conserved throughout the iLBP family. 相似文献
18.
MOTIVATION: The study of sequence space, and the deciphering of the structure of protein families and subfamilies, has up to now been required for work in comparative genomics and for the prediction of protein function. With the emergence of structural proteomics projects, it is becoming increasingly important to be able to select protein targets for structural studies that will appropriately cover the space of protein sequences, functions and genomic distribution. These problems are the motivation for the development of methods for clustering protein sequences and building families of potentially orthologous sequences, such as those proposed here. RESULTS: First we developed a clustering strategy (Ncut algorithm) capable of forming groups of related sequences by assessing their pairwise relationships. The results presented for the ras super-family of proteins are similar to those produced by other clustering methods, but without the need for clustering the full sequence space. The Ncut clusters are then used as the input to a process of reconstruction of groups with equilibrated genomic composition formed by closely-related sequences. The results of applying this technique to the data set used in the construction of the COG database are very similar to those derived by the human experts responsible for this database. AVAILABILITY: The analysis of different systems, including the COG equivalent 21 genomes are available at http://www.pdg.cnb.uam.es/GenoClustering.html. 相似文献
19.
Evolutionary protein engineering has been dramatically successful, producing a wide variety of new proteins with altered stability, binding affinity, and enzymatic activity. However, the success of such procedures is often unreliable, and the impact of the choice of protein, engineering goal, and evolutionary procedure is not well understood. We have created a framework for understanding aspects of the protein engineering process by computationally mapping regions of feasible sequence space for three small proteins using structure-based design protocols. We then tested the ability of different evolutionary search strategies to explore these sequence spaces. The results point to a non-intuitive relationship between the error-prone PCR mutation rate and the number of rounds of replication. The evolutionary relationships among feasible sequences reveal hub-like sequences that serve as particularly fruitful starting sequences for evolutionary search. Moreover, genetic recombination procedures were examined, and tradeoffs relating sequence diversity and search efficiency were identified. This framework allows us to consider the impact of protein structure on the allowed sequence space and therefore on the challenges that each protein presents to error-prone PCR and genetic recombination procedures. 相似文献
20.
MOTIVATION: A global view of the protein space is essential for functional and evolutionary analysis of proteins. In order to achieve this, a similarity network can be built using pairwise relationships among proteins. However, existing similarity networks employ a single similarity measure and therefore their utility depends highly on the quality of the selected measure. A more robust representation of the protein space can be realized if multiple sources of information are used. RESULTS: We propose a novel approach for analyzing multi-attribute similarity networks by combining random walks on graphs with Bayesian theory. A multi-attribute network is created by combining sequence and structure based similarity measures. For each attribute of the similarity network, one can compute a measure of affinity from a given protein to every other protein in the network using random walks. This process makes use of the implicit clustering information of the similarity network, and we show that it is superior to naive, local ranking methods. We then combine the computed affinities using a Bayesian framework. In particular, when we train a Bayesian model for automated classification of a novel protein, we achieve high classification accuracy and outperform single attribute networks. In addition, we demonstrate the effectiveness of our technique by comparison with a competing kernel-based information integration approach. 相似文献