首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Rashin AA  Rashin AH 《Proteins》2007,66(2):321-341
Two-dimensional lattice protein models were studied in two approximations of the conformational equilibrium to elucidate the role of surface hydrophobic groups in their stabilities. We demonstrate that stability of any compactly folded sequence is determined by its ability to "flip-flop" (refold) into alternative compact structures. The degree of stability required for folded sequences determines the average numbers of surface hydrophobic groups in stable lattice structures which are in good agreement with ratios of core to surface hydrophobic groups in real proteins. However, the average destabilization of the native structure per surface hydrophobic group is small (0-0.25 kcal/mol), often disagrees with the free energies derived from the ratios of core to surface hydrophobic groups in the same structures, and has a combinatorial entropic nature independent of the strength of structure stabilizing interactions. This suggests that the free energies derived from the core to surface ratios of hydrophobic groups in real proteins have little to do with folding thermodynamics. On average, sequences with highly stable native structures are the least hydrophobic. The results suggest that in designing novel stable proteins hydrophobic groups on the surface should be avoided to reduce the possibility of flip-flopping. The average stability of highly designable structures is never higher than that of some low designability structures, contrary to the accepted view. In the equilibrium approximation with alternative compact and partially unfolded structures, the requirement of high stability selects a unique 5 x 5 structure formed by only a few sequences, suggesting much stronger sequence selectivity than commonly thought.  相似文献   

2.
The complexity of the global organization and internal structure of motifs in higher eukaryotic organisms raises significant challenges for motif detection techniques. To achieve successful de novo motif detection, it is necessary to model the complex dependencies within and among motifs and to incorporate biological prior knowledge. In this paper, we present LOGOS, an integrated LOcal and GlObal motif Sequence model for biopolymer sequences, which provides a principled framework for developing, modularizing, extending and computing expressive motif models for complex biopolymer sequence analysis. LOGOS consists of two interacting submodels: HMDM, a local alignment model capturing biological prior knowledge and positional dependency within the motif local structure; and HMM, a global motif distribution model modeling frequencies and dependencies of motif occurrences. Model parameters can be fit using training motifs within an empirical Bayesian framework. A variational EM algorithm is developed for de novo motif detection. LOGOS improves over existing models that ignore biological priors and dependencies in motif structures and motif occurrences, and demonstrates superior performance on both semi-realistic test data and cis-regulatory sequences from yeast and Drosophila genomes with regard to sensitivity, specificity, flexibility and extensibility.  相似文献   

3.
RNA binding proteins recognize RNA targets in a sequence specific manner. Apart from the sequence, the secondary structure context of the binding site also affects the binding affinity. Binding sites are often located in single-stranded RNA regions and it was shown that the sequestration of a binding motif in a double-strand abolishes protein binding. Thus, it is desirable to include knowledge about RNA secondary structures when searching for the binding motif of a protein. We present the approach MEMERIS for searching sequence motifs in a set of RNA sequences and simultaneously integrating information about secondary structures. To abstract from specific structural elements, we precompute position-specific values measuring the single-strandedness of all substrings of an RNA sequence. These values are used as prior knowledge about the motif starts to guide the motif search. Extensive tests with artificial and biological data demonstrate that MEMERIS is able to identify motifs in single-stranded regions even if a stronger motif located in double-strand parts exists. The discovered motif occurrences in biological datasets mostly coincide with known protein-binding sites. This algorithm can be used for finding the binding motif of single-stranded RNA-binding proteins in SELEX or other biological sequence data.  相似文献   

4.
Integral membrane proteins usually have a predominantly alpha-helical secondary structure in which transmembrane segments are connected by membrane-extrinsic loops. Although a number of membrane protein structures have been reported in recent years, in most cases transmembrane topologies are initially predicted using a variety of theoretical techniques, including hydropathy analyses and the "positive inside" rule. We have explored the use of plots of the distribution of sequence similarity within families of membrane proteins comprising homeomorphic domains as a new method for the prediction/verification of the orientation of transmembrane topology models within certain families of multimeric respiratory chain enzymes. Within such proteins, analyses of sequence similarity can: i) identify heme and/or quinol binding sites; ii) identify potential electron-transfer conduits to/from prosthetic groups; and iii) locate regions defining potential subunit-subunit interactions. We mined emerging bioinformatic data for sequences of 11 families of membrane-intrinsic proteins that are part of multimeric respiratory chain complexes that also have membrane-extrinsic subunits. The sequences of each family were then aligned and the resultant alignments converted into a graphical format recording an empirical measure of the sequence similarity plotted versus residue position. In each case, this plot was compared to the predicted transmembrane topology. With one exception, there is a strong correlation between the existence  相似文献   

5.
Das C  Frankel AD 《Biopolymers》2003,70(1):80-85
Studies of RNA-binding peptides, and recent combinatorial library experiments in particular, have demonstrated that diverse peptide sequences and structures can be used to recognize specific RNA sites. The identification of large numbers of sequences capable of binding to a particular site has provided extensive phylogenetic information used to deduce basic principles of recognition. The high frequency at which RNA-binding peptides are found in large sequence libraries suggests plausible routes to evolve sequence-specific binders, facilitating the design of new binding molecules and perhaps reflecting characteristics of natural evolution.  相似文献   

6.
We have used a "Perceptron" algorithm to find a weighting function which distinguishes E. coli translational initiation sites from all other sites in a library of over 78,000 nucleotides of mRNA sequence. The "Perceptron" examined sequences as linear representations. The "Perceptron" is more successful at finding gene beginnings than our previous searches using "rules" (see previous paper). We note that the weighting function can find translational initiation sites within sequences that were not included in the training set.  相似文献   

7.
We present simulations of non-enzymatic template-directed RNA synthesis that incorporate primer extension, ligation, melting, and reannealing. Strand growth occurs over multiple heating/cooling cycles, producing strands of several hundred nucleotides in length, starting with random oligomers of 4 to 10 nucleotides. A strand typically grows by only 1 or 2 nucleotides in each cycle. Therefore, a strand is copied from many different templates, not from one specific complementary strand. A diverse sequence mixture is produced, and there is no exact copying of sequences, even if single base additions are fully accurate (no mutational errors). It has been proposed that RNA systems may contain a virtual circular genome, in which sequences partially overlap in a way that is mutually catalytic. We show that virtual circles do not emerge naturally in our simulations, and that a system initiated with a virtual circle can only maintain itself if there are no mutational errors and there is no input of new sequences formed by random polymerization. Furthermore, if a virtual sequence and its complement contain repeated short words, new sequences can be produced that were not on the original virtual circle. Therefore the virtual circle sequence cannot maintain itself. Functional sequences with secondary structures contain complementary words on opposite sides of stem regions. Both these words are repeated in the complementary sequence; hence, functional sequences cannot be encoded on a virtual circle. Additionally, we consider sequence replication in populations of protocells. We suppose that functional ribozymes benefit the cell which contains them. Nevertheless, scrambling of sequences occurs, and the functional sequence is not maintained, even when under positive selection.  相似文献   

8.
The most probable secondary structure of an RNA molecule, given the nucleotide sequence, can be computed efficiently if a stochastic context-free grammar (SCFG) is used as the prior distribution of the secondary structure. The structures of some RNA molecules contain so-called pseudoknots. Allowing all possible configurations of pseudoknots is not compatible with context-free grammar models and makes the search for an optimal secondary structure NP-complete. We suggest a probabilistic model for RNA secondary structures with pseudoknots and present a Markov-chain Monte-Carlo Method for sampling RNA structures according to their posterior distribution for a given sequence. We favor Bayesian sampling over optimization methods in this context, because it makes the uncertainty of RNA structure predictions assessable. We demonstrate the benefit of our method in examples with tmRNA and also with simulated data. McQFold, an implementation of our method, is freely available from http://www.cs.uni-frankfurt.de/~metzler/McQFold.  相似文献   

9.
Most phylogenetic models of protein evolution assume that sites are independent and identically distributed. Interactions between sites are ignored, and the likelihood can be conveniently calculated as the product of the individual site likelihoods. The calculation considers all possible transition paths (also called substitution histories or mappings) that are consistent with the observed states at the terminals, and the probability density of any particular reconstruction depends on the substitution model. The likelihood is the integral of the probability density of each substitution history taken over all possible histories that are consistent with the observed data. We investigated the extent to which transition paths that are incompatible with a protein's three-dimensional structure contribute to the likelihood. Several empirical amino acid models were tested for sequence pairs of different degrees of divergence. When simulating substitutional histories starting from a real sequence, the structural integrity of the simulated sequences quickly disintegrated. This result indicates that simple models are clearly unable to capture the constraints on sequence evolution. However, when we sampled transition paths between real sequences from the posterior probability distribution according to these same models, we found that the sampled histories were largely consistent with the tertiary structure. This suggests that simple empirical substitution models may be adequate for interpolating changes between observed sequences during phylogenetic inference despite the fact that the models cannot predict the effects of structural constraints from first principles. This study is significant because it provides a quantitative assessment of the biological realism of substitution models from the perspective of protein structure, and it provides insight on the prospects for improving models of protein sequence evolution.  相似文献   

10.
11.
Structure-based prediction of DNA target sites by regulatory proteins   总被引:15,自引:0,他引:15  
Kono H  Sarai A 《Proteins》1999,35(1):114-131
Regulatory proteins play a critical role in controlling complex spatial and temporal patterns of gene expression in higher organism, by recognizing multiple DNA sequences and regulating multiple target genes. Increasing amounts of structural data on the protein-DNA complex provides clues for the mechanism of target recognition by regulatory proteins. The analyses of the propensities of base-amino acid interactions observed in those structural data show that there is no one-to-one correspondence in the interaction, but clear preferences exist. On the other hand, the analysis of spatial distribution of amino acids around bases shows that even those amino acids with strong base preference such as Arg with G are distributed in a wide space around bases. Thus, amino acids with many different geometries can form a similar type of interaction with bases. The redundancy and structural flexibility in the interaction suggest that there are no simple rules in the sequence recognition, and its prediction is not straightforward. However, the spatial distributions of amino acids around bases indicate a possibility that the structural data can be used to derive empirical interaction potentials between amino acids and bases. Such information extracted from structural databases has been successfully used to predict amino acid sequences that fold into particular protein structures. We surmised that the structures of protein-DNA complexes could be used to predict DNA target sites for regulatory proteins, because determining DNA sequences that bind to a particular protein structure should be similar to finding amino acid sequences that fold into a particular structure. Here we demonstrate that the structural data can be used to predict DNA target sequences for regulatory proteins. Pairwise potentials that determine the interaction between bases and amino acids were empirically derived from the structural data. These potentials were then used to examine the compatibility between DNA sequences and the protein-DNA complex structure in a combinatorial "threading" procedure. We applied this strategy to the structures of protein-DNA complexes to predict DNA binding sites recognized by regulatory proteins. To test the applicability of this method in target-site prediction, we examined the effects of cognate and noncognate binding, cooperative binding, and DNA deformation on the binding specificity, and predicted binding sites in real promoters and compared with experimental data. These results show that target binding sites for several regulatory proteins are successfully predicted, and our data suggest that this method can serve as a powerful tool for predicting multiple target sites and target genes for regulatory proteins.  相似文献   

12.
Mapping the landscape of possible macromolecular polymer sequences to their fitness in performing biological functions is a challenge across the biosciences. A paradigm is the case of aptamers, nucleic acids that can be selected to bind particular target molecules. We have characterized the sequence-fitness landscape for aptamers binding allophycocyanin (APC) protein via a novel Closed Loop Aptameric Directed Evolution (CLADE) approach. In contrast to the conventional SELEX methodology, selection and mutation of aptamer sequences was carried out in silico, with explicit fitness assays for 44 131 aptamers of known sequence using DNA microarrays in vitro. We capture the landscape using a predictive machine learning model linking sequence features and function and validate this model using 5500 entirely separate test sequences, which give a very high observed versus predicted correlation of 0.87. This approach reveals a complex sequence-fitness mapping, and hypotheses for the physical basis of aptameric binding; it also enables rapid design of novel aptamers with desired binding properties. We demonstrate an extension to the approach by incorporating prior knowledge into CLADE, resulting in some of the tightest binding sequences.  相似文献   

13.
We describe an algorithm to design the primary structures for peptides which must have the strongest binding to a given molecular surface. This problem cannot be solved by a direct combinatorial sorting, because of an enormous number of possible primary and spatial structures. The approach to solve this problem is to describe a state of each residue by two variables: (i) amino acid type and (ii) 3-D coordinate, and to minimize binding energy over all these variables simultaneously. For short chains which have no long-range interactions within themselves, this minimization can be done easily and efficiently by dynamic programming. We also discuss the problem of how to estimate specificity of binding and how to deduce a sequence with maximal specificity for a given surface. We show that this sequence can be deduced by the same algorithm after some modification of energetic parameters.  相似文献   

14.
Gene Structure Prediction by Linguistic Methods   总被引:1,自引:0,他引:1  
The higher-order structure of genes and other features of biological sequences can be described by means of formal grammars. These grammars can then be used by general-purpose parsers to detect and to assemble such structures by means of syntactic pattern recognition. We describe a grammar and parser for eukaryotic protein-encodillg genes, which by some measures is as effective as current connectionist and combinatorial algorithms in predicting gene structures for sequence database entries. Parameters of the grammar rules are optimized for several different species, and mixing experiments are performed to determine the degree of species specificity and the relative importance of compositional, signal-based, and syntactic components in gene prediction.  相似文献   

15.
The Arabidopsis basic/helix-loop-helix transcription factor family   总被引:25,自引:0,他引:25       下载免费PDF全文
  相似文献   

16.
Kumaran D  Maguire EA 《Neuron》2006,49(4):617-629
Sequence disambiguation, the process by which overlapping sequences are kept separate, has been proposed to underlie a wide range of memory capacities supported by the hippocampus, including episodic memory and spatial navigation. We used functional magnetic resonance imaging (fMRI) to explore the dynamic pattern of hippocampal activation during the encoding of sequences of faces. Activation in right posterior hippocampus, only during the encoding of overlapping sequences but not nonoverlapping sequences, was found to correlate robustly with a subject-specific behavioral index of sequence learning. Moreover, our data indicate that hippocampal activation in response to elements common to both sequences in the overlapping sequence pair, may be particularly important for accurate sequence encoding and retrieval. Together, these findings support the conclusion that the human hippocampus is involved in the earliest stage of sequence disambiguation, when memory representations are in the process of being created, and provide empirical support for contemporary computational models of hippocampal function.  相似文献   

17.

Background

Translating a known metabolic network into a dynamic model requires reasonable guesses of all enzyme parameters. In Bayesian parameter estimation, model parameters are described by a posterior probability distribution, which scores the potential parameter sets, showing how well each of them agrees with the data and with the prior assumptions made.

Results

We compute posterior distributions of kinetic parameters within a Bayesian framework, based on integration of kinetic, thermodynamic, metabolic, and proteomic data. The structure of the metabolic system (i.e., stoichiometries and enzyme regulation) needs to be known, and the reactions are modelled by convenience kinetics with thermodynamically independent parameters. The parameter posterior is computed in two separate steps: a first posterior summarises the available data on enzyme kinetic parameters; an improved second posterior is obtained by integrating metabolic fluxes, concentrations, and enzyme concentrations for one or more steady states. The data can be heterogenous, incomplete, and uncertain, and the posterior is approximated by a multivariate log-normal distribution. We apply the method to a model of the threonine synthesis pathway: the integration of metabolic data has little effect on the marginal posterior distributions of individual model parameters. Nevertheless, it leads to strong correlations between the parameters in the joint posterior distribution, which greatly improve the model predictions by the following Monte-Carlo simulations.

Conclusion

We present a standardised method to translate metabolic networks into dynamic models. To determine the model parameters, evidence from various experimental data is combined and weighted using Bayesian parameter estimation. The resulting posterior parameter distribution describes a statistical ensemble of parameter sets; the parameter variances and correlations can account for missing knowledge, measurement uncertainties, or biological variability. The posterior distribution can be used to sample model instances and to obtain probabilistic statements about the model's dynamic behaviour.  相似文献   

18.
Ever since reversible protein phosphorylation was discovered, it has been clear that it plays a key role in the regulation of cellular processes. Proteins often undergo double phosphorylation, which can occur through two possible mechanisms: distributive or processive. Which phosphorylation mechanism is chosen for a particular cellular regulation bears biological significance, and it is therefore in our interest to understand these mechanisms. In this paper we study dynamics of the MEK/ERK phosphorylation. We employ a model selection algorithm based on approximate Bayesian computation to elucidate phosphorylation dynamics from quantitative time course data obtained from PC12 cells in vivo. The algorithm infers the posterior distribution over four proposed models for phosphorylation and dephosphorylation dynamics, and this distribution indicates the amount of support given to each model. We evaluate the robustness of our inferential framework by systematically exploring different ways of parameterizing the models and for different prior specifications. The models with the highest inferred posterior probability are the two models employing distributive dephosphorylation, whereas we are unable to choose decisively between the processive and distributive phosphorylation mechanisms.  相似文献   

19.
20.

Background

Random biological sequences are a topic of great interest in genome analysis since, according to a powerful paradigm, they represent the background noise from which the actual biological information must differentiate. Accordingly, the generation of random sequences has been investigated for a long time. Similarly, random object of a more complicated structure like RNA molecules or proteins are of interest.

Results

In this article, we present a new general framework for deriving algorithms for the non-uniform random generation of combinatorial objects according to the encoding and probability distribution implied by a stochastic context-free grammar. Briefly, the framework extends on the well-known recursive method for (uniform) random generation and uses the popular framework of admissible specifications of combinatorial classes, introducing weighted combinatorial classes to allow for the non-uniform generation by means of unranking. This framework is used to derive an algorithm for the generation of RNA secondary structures of a given fixed size. We address the random generation of these structures according to a realistic distribution obtained from real-life data by using a very detailed context-free grammar (that models the class of RNA secondary structures by distinguishing between all known motifs in RNA structure). Compared to well-known sampling approaches used in several structure prediction tools (such as SFold) ours has two major advantages: Firstly, after a preprocessing step in time for the computation of all weighted class sizes needed, with our approach a set of m random secondary structures of a given structure size n can be computed in worst-case time complexity while other algorithms typically have a runtime in . Secondly, our approach works with integer arithmetic only which is faster and saves us from all the discomforting details of using floating point arithmetic with logarithmized probabilities.

Conclusion

A number of experimental results shows that our random generation method produces realistic output, at least with respect to the appearance of the different structural motifs. The algorithm is available as a webservice at http://wwwagak.cs.uni-kl.de/NonUniRandGen and can be used for generating random secondary structures of any specified RNA type. A link to download an implementation of our method (in Wolfram Mathematica) can be found there, too.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号