首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
We consider hidden Markov models as a versatile class of models for weakly dependent random phenomena. The topic of the present paper is likelihood-ratio testing for hidden Markov models, and we show that, under appropriate conditions, the standard asymptotic theory of likelihood-ratio tests is valid. Such tests are crucial in the specification of multivariate Gaussian hidden Markov models, which we use to illustrate the applicability of our general results. Finally, the methodology is illustrated by means of a real data set.  相似文献   

3.
Recent progress in structure determination techniques has led to a significant growth in the number of known membrane protein structures, and the first structural genomics projects focusing on membrane proteins have been initiated, warranting an investigation of appropriate bioinformatics strategies for optimal structural target selection for these molecules. What determines a membrane protein fold? How many membrane structures need to be solved to provide sufficient structural coverage of the membrane protein sequence space? We present the CAMPS database (Computational Analysis of the Membrane Protein Space) containing almost 45,000 proteins with three or more predicted transmembrane helices (TMH) from 120 bacterial species. This large set of membrane proteins was subjected to single‐linkage clustering using only sequence alignments covering at least 40% of the TMH present in a given family. This process yielded 266 sequence clusters with at least 15 members, roughly corresponding to membrane structural folds, sufficiently structurally homogeneous in terms of the variation of TMH number between individual sequences. These clusters were further subdivided into functionally homogeneous subclusters according to the COG (Clusters of Orthologous Groups) system as well as more stringently defined families sharing at least 30% identity. The CAMPS sequence clusters are thus designed to reflect three main levels of interest for structural genomics: fold, function, and modeling distance. We present a library of Hidden Markov Models (HMM) derived from sequence alignments of TMH at these three levels of sequence similarity. Given that 24 out of 266 clusters corresponding to membrane folds already have associated known structures, we estimate that 242 additional new structures, one for each remaining cluster, would provide structural coverage at the fold level of roughly 70% of prokaryotic membrane proteins belonging to the currently most populated families. Proteins 2006. © 2006 Wiley‐Liss, Inc.  相似文献   

4.
Paul Mach  Patrice Koehl 《Proteins》2013,81(9):1556-1570
It is well known that protein fold recognition can be greatly improved if models for the underlying evolution history of the folds are taken into account. The improvement, however, exists only if such evolutionary information is available. To circumvent this limitation for protein families that only have a small number of representatives in current sequence databases, we follow an alternate approach in which the benefits of including evolutionary information can be recreated by using sequences generated by computational protein design algorithms. We explore this strategy on a large database of protein templates with 1747 members from different protein families. An automated method is used to design sequences for these templates. We use the backbones from the experimental structures as fixed templates, thread sequences on these backbones using a self‐consistent mean field approach, and score the fitness of the corresponding models using a semi‐empirical physical potential. Sequences designed for one template are translated into a hidden Markov model‐based profile. We describe the implementation of this method, the optimization of its parameters, and its performance. When the native sequences of the protein templates were tested against the library of these profiles, the class, fold, and family memberships of a large majority (>90%) of these sequences were correctly recognized for an E‐value threshold of 1. In contrast, when homologous sequences were tested against the same library, a much smaller fraction (35%) of sequences were recognized; The structural classification of protein families corresponding to these sequences, however, are correctly recognized (with an accuracy of >88%). Proteins 2013; © 2013 Wiley Periodicals, Inc.  相似文献   

5.
6.
Over the years, there have been claims that evolution proceeds according to systematically different processes over different timescales and that protein evolution behaves in a non-Markovian manner. On the other hand, Markov models are fundamental to many applications in evolutionary studies. Apparent non-Markovian or time-dependent behavior has been attributed to influence of the genetic code at short timescales and dominance of physicochemical properties of the amino acids at long timescales. However, any long time period is simply the accumulation of many short time periods, and it remains unclear why evolution should appear to act systematically differently across the range of timescales studied. We show that the observed time-dependent behavior can be explained qualitatively by modeling protein sequence evolution as an aggregated Markov process (AMP): a time-homogeneous Markovian substitution model observed only at the level of the amino acids encoded by the protein-coding DNA sequence. The study of AMPs sheds new light on the relationship between amino acid-level and codon-level models of sequence evolution, and our results suggest that protein evolution should be modeled at the codon level rather than using amino acid substitution models.  相似文献   

7.
Yang Q  Sharp KA 《Proteins》2009,74(3):682-700
We describe a method for efficiently generating ensembles of alternate, all-atom protein structures that (a) differ significantly from the starting structure, (b) have good stereochemistry (bonded geometry), and (c) have good steric properties (absence of atomic overlap). The method uses reconstruction from a series of backbone framework structures that are obtained from a modified elastic network model (ENM) by perturbation along low-frequency normal modes. To ensure good quality backbone frameworks, the single force parameter ENM is modified by introducing two more force parameters to characterize the interaction between the consecutive carbon alphas and those within the same secondary structure domain. The relative stiffness of the three parameters is parameterized to reproduce B-factors, while maintaining good bonded geometry. After parameterization, violations of experimental Calpha-Calpha distances and Calpha-Calpha-Calpha pseudo angles along the backbone are reduced to less than 1%. Simultaneously, the average B-factor correlation coefficient improves to R = 0.77. Two applications illustrate the potential of the approach. (1) 102,051 protein backbones spanning a conformational space of 15 A root mean square deviation were generated from 148 nonredundant proteins in the PDB database, and all-atom models with minimal bonded and nonbonded violations were produced from this ensemble of backbone structures using the SCWRL side chain building program. (2) Improved backbone templates for homology modeling. Fifteen query sequences were each modeled on two targets. For each of the 30 target frameworks, dozens of improved templates could be produced In all cases, improved full atom homology models resulted, of which 50% could be identified blind using the D-Fire statistical potential.  相似文献   

8.
It is commonly believed that similarities between the sequences of two proteins infer similarities between their structures. Sequence alignments reliably recognize pairs of protein of similar structures provided that the percentage sequence identity between their two sequences is sufficiently high. This distinction, however, is statistically less reliable when the percentage sequence identity is lower than 30% and little is known then about the detailed relationship between the two measures of similarity. Here, we investigate the inverse correlation between structural similarity and sequence similarity on 12 protein structure families. We define the structure similarity between two proteins as the cRMS distance between their structures. The sequence similarity for a pair of proteins is measured as the mean distance between the sequences in the subsets of sequence space compatible with their structures. We obtain an approximation of the sequence space compatible with a protein by designing a collection of protein sequences both stable and specific to the structure of that protein. Using these measures of sequence and structure similarities, we find that structural changes within a protein family are linearly related to changes in sequence similarity.  相似文献   

9.
Protein-protein interactions are governed by the change in free energy upon binding, ΔG = ΔH - TΔS. These interactions are often marginally stable, so one must examine the balance between the change in enthalpy, ΔH, and the change in entropy, ΔS, when investigating known complexes, characterizing the effects of mutations, or designing optimized variants. To perform a large-scale study into the contribution of conformational entropy to binding free energy, we developed a technique called GOBLIN (Graphical mOdel for BiomoLecular INteractions) that performs physics-based free energy calculations for protein-protein complexes under both side-chain and backbone flexibility. Goblin uses a probabilistic graphical model that exploits conditional independencies in the Boltzmann distribution and employs variational inference techniques that approximate the free energy of binding in only a few minutes. We examined the role of conformational entropy on a benchmark set of more than 700 mutants in eight large, well-studied complexes. Our findings suggest that conformational entropy is important in protein-protein interactions--the root mean square error (RMSE) between calculated and experimentally measured ΔΔGs decreases by 12% when explicit entropic contributions were incorporated. GOBLIN models all atoms of the protein complex and detects changes to the binding entropy along the interface as well as positions distal to the binding interface. Our results also suggest that a variational approach to entropy calculations may be quantitatively more accurate than the knowledge-based approaches used by the well-known programs FOLDX and Rosetta--GOBLIN's RMSEs are 10 and 36% lower than these programs, respectively.  相似文献   

10.
Prediction of transmembrane (TM) segments of amino acid sequences of membrane proteins is a well-known and very important problem. The accuracy of its solution can be improved for approaches that do not use a homology search in an additional data bank. There is a lack of tested data in this area of research, because information on the structure of membrane proteins is scarce. In this work we created a test sample of structural alignments for membrane proteins. The TM segments of these proteins were mapped according to aligned 3D structures resolved for these proteins. A method for predicting TM segments in an alignment was developed on the basis of the forward-backward algorithm from the HMM theory. This method allows a user not only to predict TM segments, but also to create a probabilistic membrane profile, which can be employed in multiple alignment procedures taking the secondary structure of proteins into account. The method was implemented in a computer program available at http://bioinf.fbb.msu.ru/fwdbck/. It provides better results than the MEMSAT method, which is nearly the only tool predicting TM segments in multiple alignments, without a homology search.  相似文献   

11.
Methods that predict the topology of helical membrane proteins are standard tools when analyzing any proteome. Therefore, it is important to improve the performance of such methods. Here we introduce a novel method, PRODIV-TMHMM, which is a profile-based hidden Markov model (HMM) that also incorporates the best features of earlier HMM methods. In our tests, PRODIV-TMHMM outperforms earlier methods both when evaluated on "low-resolution" topology data and on high-resolution 3D structures. The results presented here indicate that the topology could be correctly predicted for approximately two-thirds of all membrane proteins using PRODIV-TMHMM. The importance of evolutionary information for topology prediction is emphasized by the fact that compared with using single sequences, the performance of PRODIV-TMHMM (as well as two other methods) is increased by approximately 10 percentage units by the use of homologous sequences. On a more general level, we also show that HMM-based (or similar) methods perform superiorly to methods that focus mainly on identification of the membrane regions.  相似文献   

12.
How to characterize short protein sequences to make an effective connection to their functions is an unsolved problem. Here we propose to map the physicochemical properties of each amino acid onto unit spheres so that each protein sequence can be represented quantitatively. We demonstrate the usefulness of this representation by applying it to the prediction of cell penetrating peptides. We show that its combination with traditional composition features yields the best performance across different datasets, among several methods compared. For the convenience of users, a web server has been established for automatic calculations of the proposed features at http://biophy.dzu.edu.cn/SNumD/ .  相似文献   

13.
In nature, proteins partake in numerous protein– protein interactions that mediate their functions. Moreover, proteins have been shown to be physically stable in multiple structures, induced by cellular conditions, small ligands, or covalent modifications. Understanding how protein sequences achieve this structural promiscuity at the atomic level is a fundamental step in the drug design pipeline and a critical question in protein physics. One way to investigate this subject is to computationally predict protein sequences that are compatible with multiple states, i.e., multiple target structures or binding to distinct partners. The goal of engineering such proteins has been termed multispecific protein design. We develop a novel computational framework to efficiently and accurately perform multispecific protein design. This framework utilizes recent advances in probabilistic graphical modeling to predict sequences with low energies in multiple target states. Furthermore, it is also geared to specifically yield positional amino acid probability profiles compatible with these target states. Such profiles can be used as input to randomly bias high‐throughput experimental sequence screening techniques, such as phage display, thus providing an alternative avenue for elucidating the multispecificity of natural proteins and the synthesis of novel proteins with specific functionalities. We prove the utility of such multispecific design techniques in better recovering amino acid sequence diversities similar to those resulting from millions of years of evolution. We then compare the approaches of prediction of low energy ensembles and of amino acid profiles and demonstrate their complementarity in providing more robust predictions for protein design. Proteins 2010. © 2009 Wiley‐Liss, Inc.  相似文献   

14.
Zhao F  Li S  Sterner BW  Xu J 《Proteins》2008,73(1):228-240
Protein structure prediction without using templates (i.e., ab initio folding) is one of the most challenging problems in structural biology. In particular, conformation sampling poses as a major bottleneck of ab initio folding. This article presents CRFSampler, an extensible protein conformation sampler, built on a probabilistic graphical model Conditional Random Fields (CRFs). Using a discriminative learning method, CRFSampler can automatically learn more than ten thousand parameters quantifying the relationship among primary sequence, secondary structure, and (pseudo) backbone angles. Using only compactness and self-avoiding constraints, CRFSampler can efficiently generate protein-like conformations from primary sequence and predicted secondary structure. CRFSampler is also very flexible in that a variety of model topologies and feature sets can be defined to model the sequence-structure relationship without worrying about parameter estimation. Our experimental results demonstrate that using a simple set of features, CRFSampler can generate decoys with much higher quality than the most recent HMM model.  相似文献   

15.
Chaudhuri I  Söding J  Lupas AN 《Proteins》2008,71(2):795-803
beta-Propellers are toroidal folds, in which repeated, four-stranded beta-meanders are arranged in a circular and slightly tilted fashion, like the blades of a propeller. They are found in all domains of life, with a strong preponderance among eukaryotes. Propellers show considerable sequence diversity and are classified into six separate structural groups by the SCOP and CATH databases. Despite this diversity, they often show similarities across groups, not only in structure but also in sequence, raising the possibility of a common origin. In agreement with this hypothesis, most propellers group together in a cluster map of all-beta folds generated by sequence similarity, because of numerous pairwise matches, many of which are individually nonsignificant. In total, 45 of 60 propellers in the SCOP25 database, covering four SCOP folds, are clustered in this group and analysis with sensitive sequence comparison methods shows that they are similar at a level indicative of homology. Two mechanisms appear to contribute to the evolution of beta-propellers: amplification from single blades and subsequent functional differentiation. The observation of propellers with nearly identical blades in genomic sequences show that these mechanisms are still operating today.  相似文献   

16.
A method is described to construct sets of decoy models that can be used to generate a background score distribution for protein structure comparison. The models are derived directly from the two proteins being compared and retain all the essential properties of the structures, including length, density, shape and secondary structure composition but have different folds. As each comparison involves a pair of proteins of the same length, no explicit normalisation is required to adjust for the length of the proteins being compared. This allows substructure (or domain) matches to score almost equally to the comparison of isolated domains. A normalised probability measure was derived that allows joint family/family comparison. The method was applied to some of the CASP6 models for targets with new folds.  相似文献   

17.
In the past, 2 kinds of Markov models have been considered to describe protein sequence evolution. Codon-level models have been mechanistic with a small number of parameters designed to take into account features, such as transition-transversion bias, codon frequency bias, and synonymous-nonsynonymous amino acid substitution bias. Amino acid models have been empirical, attempting to summarize the replacement patterns observed in large quantities of data and not explicitly considering the distinct factors that shape protein evolution. We have estimated the first empirical codon model (ECM). Previous codon models assume that protein evolution proceeds only by successive single nucleotide substitutions, but our results indicate that model accuracy is significantly improved by incorporating instantaneous doublet and triplet changes. We also find that the affiliations between codons, the amino acid each encodes and the physicochemical properties of the amino acids are main factors driving the process of codon evolution. Neither multiple nucleotide changes nor the strong influence of the genetic code nor amino acids' physicochemical properties form a part of standard mechanistic models and their views of how codon evolution proceeds. We have implemented the ECM for likelihood-based phylogenetic analysis, and an assessment of its ability to describe protein evolution shows that it consistently outperforms comparable mechanistic codon models. We point out the biological interpretation of our ECM and possible consequences for studies of selection.  相似文献   

18.
19.
Nathoo FS  Dean CB 《Biometrics》2008,64(1):271-279
Summary .   Follow-up medical studies often collect longitudinal data on patients. Multistate transitional models are useful for analysis in such studies where at any point in time, individuals may be said to occupy one of a discrete set of states and interest centers on the transition process between states. For example, states may refer to the number of recurrences of an event, or the stage of a disease. We develop a hierarchical modeling framework for the analysis of such longitudinal data when the processes corresponding to different subjects may be correlated spatially over a region. Continuous-time Markov chains incorporating spatially correlated random effects are introduced. Here, joint modeling of both spatial dependence as well as dependence between different transition rates is required and a multivariate spatial approach is employed. A proportional intensities frailty model is developed where baseline intensity functions are modeled using parametric Weibull forms, piecewise-exponential formulations, and flexible representations based on cubic B-splines. The methodology is developed within the context of a study examining invasive cardiac procedures in Quebec. We consider patients admitted for acute coronary syndrome throughout the 139 local health units of the province and examine readmission and mortality rates over a 4-year period.  相似文献   

20.
Databases of multiple sequence alignments are a valuable aid to protein sequence classification and analysis. One of the main challenges when constructing such a database is to simultaneously satisfy the conflicting demands of completeness on the one hand and quality of alignment and domain definitions on the other. The latter properties are best dealt with by manual approaches, whereas completeness in practice is only amenable to automatic methods. Herein we present a database based on hidden Markov model profiles (HMMs), which combines high quality and completeness. Our database, Pfam, consists of parts A and B. Pfam-A is curated and contains well-characterized protein domain families with high quality alignments, which are maintained by using manually checked seed alignments and HMMs to find and align all members. Pfam-B contains sequence families that were generated automatically by applying the Domainer algorithm to cluster and align the remaining protein sequences after removal of Pfam-A domains. By using Pfam, a large number of previously unannotated proteins from the Caenorhabditis elegans genome project were classified. We have also identified many novel family memberships in known proteins, including new kazal, Fibronectin type III, and response regulator receiver domains. Pfam-A families have permanent accession numbers and form a library of HMMs available for searching and automatic annotation of new protein sequences. Proteins: 28:405–420, 1997. © 1997 Wiley-Liss, Inc.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号