首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 21 毫秒
1.
A statistical reference for RNA secondary structures with minimum free energies is computed by folding large ensembles of random RNA sequences. Four nucleotide alphabets are used: two binary alphabets, AU and GC, the biophysical AUGC and the synthetic GCXK alphabet. RNA secondary structures are made of structural elements, such as stacks, loops, joints, and free ends. Statistical properties of these elements are computed for small RNA molecules of chain lengths up to 100. The results of RNA structure statistics depend strongly on the particular alphabet chosen. The statistical reference is compared with the data derived from natural RNA molecules with similar base frequencies. Secondary structures are represented as trees. Tree editing provides a quantitative measure for the distance dt, between two structures. We compute a structure density surface as the conditional probability of two structures having distance t given that their sequences have distance h. This surface indicates that the vast majority of possible minimum free energy secondary structures occur within a fairly small neighborhood of any typical (random) sequence. Correlation lengths for secondary structures in their tree representations are computed from probability densities. They are appropriate measures for the complexity of the sequence-structure relation. The correlation length also provides a quantitative estimate for the mean sensitivity of structures to point mutations. © 1993 John Wiley & Sons, Inc.  相似文献   

2.
Armando D. Solis 《Proteins》2015,83(12):2198-2216
To reduce complexity, understand generalized rules of protein folding, and facilitate de novo protein design, the 20‐letter amino acid alphabet is commonly reduced to a smaller alphabet by clustering amino acids based on some measure of similarity. In this work, we seek the optimal alphabet that preserves as much of the structural information found in long‐range (contact) interactions among amino acids in natively‐folded proteins. We employ the Information Maximization Device, based on information theory, to partition the amino acids into well‐defined clusters. Numbering from 2 to 19 groups, these optimal clusters of amino acids, while generated automatically, embody well‐known properties of amino acids such as hydrophobicity/polarity, charge, size, and aromaticity, and are demonstrated to maintain the discriminative power of long‐range interactions with minimal loss of mutual information. Our measurements suggest that reduced alphabets (of less than 10) are able to capture virtually all of the information residing in native contacts and may be sufficient for fold recognition, as demonstrated by extensive threading tests. In an expansive survey of the literature, we observe that alphabets derived from various approaches—including those derived from physicochemical intuition, local structure considerations, and sequence alignments of remote homologs—fare consistently well in preserving contact interaction information, highlighting a convergence in the various factors thought to be relevant to the folding code. Moreover, we find that alphabets commonly used in experimental protein design are nearly optimal and are largely coherent with observations that have arisen in this work. Proteins 2015; 83:2198–2216. © 2015 Wiley Periodicals, Inc.  相似文献   

3.
What are the key building blocks that would have been needed to construct complex protein folds? This is an important issue for understanding protein folding mechanism and guiding de novo protein design. Twenty naturally occurring amino acids and eight secondary structures consist of a 28‐letter alphabet to determine folding kinetics and mechanism. Here we predict folding kinetic rates of proteins from many reduced alphabets. We find that a reduced alphabet of 10 letters achieves good correlation with folding rates, close to the one achieved by full 28‐letter alphabet. Many other reduced alphabets are not significantly correlated to folding rates. The finding suggests that not all amino acids and secondary structures are equally important for protein folding. The foldable sequence of a protein could be designed using at least 10 folding units, which can either promote or inhibit protein folding. Reducing alphabet cardinality without losing key folding kinetic information opens the door to potentially faster machine learning and data mining applications in protein structure prediction, sequence alignment and protein design. Proteins 2015; 83:631–639. © 2015 Wiley Periodicals, Inc.  相似文献   

4.
Comparisons within and between the human, mouse and rabbit immunoglobulin-kappa gene (J-C region) DNA sequences are carried out in terms of three two-letter nucleotide alphabets: (i) S-W alphabet (W = A or T; S = G or C); (ii) P-Q alphabet which distinguishes purines (P = A or G) from pyrimidines (Q = C or T); and (iii) a 'control' E-F alphabet (E = A or C; F = G or T). All statistically significant direct repeats within each of the three sequences and all significant block identities (a set of consecutive matching letters) shared by two or more sequences are determined for each alphabet. By contrast to the S-W and E-F alphabets, the P-Q alphabet comparisons reveal an abundance of statistically significant block identities not seen at the nucleotide level. Various interpretations of these P-Q structures with respect to control and functional roles are considered.  相似文献   

5.
Algorithms predicting RNA secondary structures based on different folding criteria – minimum free energies (mfe), kinetic folding (kin), maximum matching (mm) – and different parameter sets are studied systematically. Two base pairing alphabets were used: the binary GC and the natural four-letter AUGC alphabet. Computed structures and free energies depend strongly on both the algorithm and the parameter set. Statistical properties, such as mean number of base pairs, mean numbers of stacks, mean loop sizes, etc., are much less sensitive to the choice of parameter set and even of algorithm. Some features of RNA secondary structures, such as structure correlation functions, shape space covering and neutral networks, seem to depend only on the base pairing logic (GC or AUGC alphabet). Received: 16 May 1996 / Accepted: 10 July 1996  相似文献   

6.
7.
Piccirilli et al. (Nature, Lond. 343, 33-37 (1990)) have shown experimentally that the replicatable introduction of new base pairs into the genetic alphabet is chemically feasible. The fact that our current genetic alphabet uses only two base pairs can be explained provided that this basic feature of organisms became fixed in an RNA world utilizing ribozymes rather than protein enzymes. The fitness of such ribo-organisms is determined by two factors: replication fidelity and overall catalytic efficiency (basic metabolic or growth rate). Replication fidelity is shown to decrease roughly exponentially, and catalytic efficiency is shown to increase with diminishing returns, with the number of letters for a fixed genome length; hence their product, i.e. fitness, gives rise to a set of values with an optimum. Under a wide range of parameter values the optimum rests at two base pairs. The chemical identity of the particular choice in our genetic alphabet can also be rationalized. This optimum is considered frozen, as currently the dominant catalysts are proteins rather than RNAs.  相似文献   

8.
Reduced amino acid alphabets are useful to understand molecular evolution as they reveal basal, shared properties of amino acids, which the structures and functions of proteins rely on. Several previous studies derived such reduced alphabets and linked them to the origin of life and biotechnological applications. However, all this previous work presupposes that only direct contacts of amino acids in native protein structures are relevant. We show in this work, using information–theoretical measures, that an appropriate alphabet reduction scheme is in fact a function of the maximum distance amino acids interact at. Although for small distances our results agree with previous ones, we show how long‐range interactions change the overall picture and prompt for a revised understanding of the protein design process. Proteins 2010. © 2010 Wiley‐Liss, Inc.  相似文献   

9.
Similarities and differences between amino acids define the rates at which they substitute for one another within protein sequences and the patterns by which these sequences form protein structures. However, there exist many ways to measure similarity, whether one considers the molecular attributes of individual amino acids, the roles that they play within proteins, or some nuanced contribution of each. One popular approach to representing these relationships is to divide the 20 amino acids of the standard genetic code into groups, thereby forming a simplified amino acid alphabet. Here, we develop a method to compare or combine different simplified alphabets, and apply it to 34 simplified alphabets from the scientific literature. We use this method to show that while different suggestions vary and agree in non-intuitive ways, they combine to reveal a consensus view of amino acid similarity that is clearly rooted in physico-chemistry.  相似文献   

10.
A new computational approach optimizes searches for reduced protein folding alphabets that use fewer than 20 types of amino acids. The predicted optimal five-letter alphabet happens to be in agreement with the suggestive results of a recent experiment, but whether highly reduced alphabets are sufficient for truly protein-like properties remains an open experimental question.  相似文献   

11.
Ketoacyl synthases are enzymes involved in fatty acid synthesis and can be classified into five families based on primary sequence similarity. Different families have different catalytic mechanisms. Developing cost-effective computational models to identify the family of ketoacyl synthases will be helpful for enzyme engineering and in knowing individual enzymes’ catalytic mechanisms. In this work, a support vector machine-based method was developed to predict ketoacyl synthase family using the n-peptide composition of reduced amino acid alphabets. In jackknife cross-validation, the model based on the 2-peptide composition of a reduced amino acid alphabet of size 13 yielded the best overall accuracy of 96.44% with average accuracy of 93.36%, which is superior to other state-of-the-art methods. This result suggests that the information provided by n-peptide compositions of reduced amino acid alphabets provides efficient means for enzyme family classification and that the proposed model can be efficiently used for ketoacyl synthase family annotation.  相似文献   

12.
In this study, we have calculated distances between genomes based on our previously developed compositional spectra (CS) analysis. The study was conducted using genomes of 39 species of Eukarya, Eubacteria, and Archaea. Based on CS distances, we produced two different consensus dendrograms for four- and two-letter (purine-pyrimidine) alphabets. A comparison of the obtained structure using purine-pyrimidine alphabet with the standard three-kingdom (3K) scheme reveals substantial similarity. Surprisingly, this is not the case when the same procedure is based on the four-letter alphabet. In this situation, we also found three main clusters-but different from those in the 3K scheme. In particular, one of the clusters includes Eukarya and thermophilic bacteria and a part of the considered Archaea species. We speculate that the key factor in the last classification (based on the A-T-G-C alphabet) is related to ecology: two ecological parameters, temperature and oxygen, distinctly explain the clustering revealed by compositional spectra in the four-letter alphabet. Therefore, we assume that this result reflects two interdependent processes: evolutionary divergence and superimposed ecological convergence of the genomes, albeit another process, horizontal transfer, cannot be excluded as an important contributing factor.  相似文献   

13.

Background

Transfer RNA (tRNA) is the means by which the cell translates DNA sequence into protein according to the rules of the genetic code. A credible proposition is that tRNA was formed from the duplication of an RNA hairpin half the length of the contemporary tRNA molecule, with the point at which the hairpins were joined marked by the canonical intron insertion position found today within tRNA genes. If these hairpins possessed a 3'-CCA terminus with different combinations of stem nucleotides (the ancestral operational RNA code), specific aminoacylation and perhaps participation in some form of noncoded protein synthesis might have occurred. However, the identity of the first tRNA and the initial steps in the origin of the genetic code remain elusive.

Results

Here we show evidence that glycine tRNA was the first tRNA, as revealed by a vestigial imprint in the anticodon loop sequences of contemporary descendents. This provides a plausible mechanism for the missing first step in the origin of the genetic code. In 448 of 466 glycine tRNA gene sequences from bacteria, archaea and eukaryote cytoplasm analyzed, CCA occurs immediately upstream of the canonical intron insertion position, suggesting the first anticodon (NCC for glycine) has been captured from the 3'-terminal CCA of one of the interacting hairpins as a result of an ancestral ligation.

Conclusion

That this imprint (including the second and third nucleotides of the glycine tRNA anticodon) has been retained through billions of years of evolution suggests Crick's 'frozen accident' hypothesis has validity for at least this very first step at the dawn of the genetic code.

Reviewers

This article was reviewed by Dr Eugene V. Koonin, Dr Rob Knight and Dr David H Ardell.  相似文献   

14.
Li T  Fan K  Wang J  Wang W 《Protein engineering》2003,16(5):323-330
It is well known that there are some similarities among various naturally occurring amino acids. Thus, the complexity in protein systems could be reduced by sorting these amino acids with similarities into groups and then protein sequences can be simplified by reduced alphabets. This paper discusses how to group similar amino acids and whether there is a minimal amino acid alphabet by which proteins can be folded. Various reduced alphabets are obtained by reserving the maximal information for the simplified protein sequence compared with the parent sequence using global sequence alignment. With these reduced alphabets and simplified similarity matrices, we achieve recognition of the protein fold based on the similarity score of the sequence alignment. The coverage in dataset SCOP40 for various levels of reduction on the amino acid types is obtained, which is the number of homologous pairs detected by program BLAST to the number marked by SCOP40. For the reduced alphabets containing 10 types of amino acids, the ability to detect distantly related folds remains almost at the same level as that by the alphabet of 20 types of amino acids, which implies that 10 types of amino acids may be the degree of freedom for characterizing the complexity in proteins.  相似文献   

15.
Enzyme design and engineering strategies rely almost exclusively on nature's alphabet of twenty canonical amino acids. Recent years have seen the emergence of powerful genetic code expansion methods that allow hundreds of structurally diverse amino acids to be installed into proteins in a site-selective manner. Here, we will highlight how the availability of an expanded alphabet of amino acids has opened new avenues in enzyme engineering research. Genetically encoded noncanonical amino acids have provided new tools to probe complex enzyme mechanisms, improve biocatalyst activity and stability, and most ambitiously to design enzymes with new catalytic mechanisms that would be difficult to access within the constraints of the genetic code. We anticipate that the studies highlighted in this article, coupled with the continuing advancements in genetic code expansion technology, will promote the widespread use of noncanonical amino acids in biocatalysis research in the coming years.  相似文献   

16.

Background  

In this paper, it is proposed an optimization approach for producing reduced alphabets for peptide classification, using a Genetic Algorithm. The classification task is performed by a multi-classifier system where each classifier (Linear or Radial Basis function Support Vector Machines) is trained using features extracted by different reduced alphabets. Each alphabet is constructed by a Genetic Algorithm whose objective function is the maximization of the area under the ROC-curve obtained in several classification problems.  相似文献   

17.
18.
Karchin R  Cline M  Karplus K 《Proteins》2004,55(3):508-518
Residue burial, which describes a protein residue's exposure to solvent and neighboring atoms, is key to protein structure prediction, modeling, and analysis. We assessed 21 alphabets representing residue burial, according to their predictability from amino acid sequence, conservation in structural alignments, and utility in one fold-recognition scenario. This follows upon our previous work in assessing nine representations of backbone geometry.1 The alphabet found to be most effective overall has seven states and is based on a count of C(beta) atoms within a 14 A-radius sphere centered at the C(beta) of a residue of interest. When incorporated into a hidden Markov model (HMM), this alphabet gave us a 38% performance boost in fold recognition and 23% in alignment quality.  相似文献   

19.

Background  

We investigate automated and generic alphabet reduction techniques for protein structure prediction datasets. Reducing alphabet cardinality without losing key biochemical information opens the door to potentially faster machine learning, data mining and optimization applications in structural bioinformatics. Furthermore, reduced but informative alphabets often result in, e.g., more compact and human-friendly classification/clustering rules. In this paper we propose a robust and sophisticated alphabet reduction protocol based on mutual information and state-of-the-art optimization techniques.  相似文献   

20.
The question of whether the size and make-up of the natural nucleotide alphabet is a consequence of selection pressure, or simply a frozen accident, is one of the fundamental questions of biology. Nucleotide replication is essentially an information transmission phenomenon, and so it seems reasonable to explore the issue from the perspective of theoretical computer science, and of error-coding theory in particular. In this analysis it is shown that the essential recognition features of nucleotides may be naturally expressed as 4-digit binary numbers, capturing the hydrogen acceptor/donor patterns (3-bits) and the purine/pyrimidine feature (1-bit). Optimal alphabets consist of nucleotides in which the purine/pyrimidine feature is related to the acceptor/donor pattern as a parity bit. Numerically interpreted, such alphabets correspond to parity check codes, simple but effective error-resistant structures. The natural alphabet appears to be an adaptation of one of two optimal solutions, constrained to its present size and composition by a combination of chemical and coding-theory factors.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号