首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Prediction of protein secondary structure is an important step towards elucidating its three dimensional structure and its function. This is a challenging problem in bioinformatics. Segmental semi Markov models (SSMMs) are one of the best studied methods in this field. However, incorporating evolutionary information to these methods is somewhat difficult. On the other hand, the systems of multiple neural networks (NNs) are powerful tools for multi-class pattern classification which can easily be applied to take these sorts of information into account.To overcome the weakness of SSMMs in prediction, in this work we consider a SSMM as a decision function on outputs of three NNs that uses multiple sequence alignment profiles. We consider four types of observations for outputs of a neural network. Then profile table related to each sequence is reduced to a sequence of four observations. In order to predict secondary structure of each amino acid we need to consider a decision function. We use an SSMM on outputs of three neural networks. The proposed SSMM has discriminative power and weights over different dependency models for outputs of neural networks. The results show that the accuracy of our model in predictions, particularly for strands, is considerably increased.  相似文献   

2.
The profile hidden Markov model (PHMM) is widely used to assign the protein sequences to their respective families. A major limitation of a PHMM is the assumption that given states the observations (amino acids) are independent. To overcome this limitation, the dependency between amino acids in a multiple sequence alignment (MSA) which is the representative of a PHMM can be appended to the PHMM. Due to the fact that with a MSA, the sequences of amino acids are biologically related, the one-by-one dependency between two amino acids can be considered. In other words, based on the MSA, the dependency between an amino acid and its corresponding amino acid located above can be combined with the PHMM. For this purpose, the new emission probability matrix which considers the one-by-one dependencies between amino acids is constructed. The parameters of a PHMM are of two types; transition and emission probabilities which are usually estimated using an EM algorithm called the Baum-Welch algorithm. We have generalized the Baum-Welch algorithm using similarity emission matrix constructed by integrating the new emission probability matrix with the common emission probability matrix. Then, the performance of similarity emission is discussed by applying it to the top twenty protein families in the Pfam database. We show that using the similarity emission in the Baum-Welch algorithm significantly outperforms the common Baum-Welch algorithm in the task of assigning protein sequences to protein families.  相似文献   

3.
Protein chemical shifts encode detailed structural information that is difficult and computationally costly to describe at a fundamental level. Statistical and machine learning approaches have been used to infer correlations between chemical shifts and secondary structure from experimental chemical shifts. These methods range from simple statistics such as the chemical shift index to complex methods using neural networks. Notwithstanding their higher accuracy, more complex approaches tend to obscure the relationship between secondary structure and chemical shift and often involve many parameters that need to be trained. We present hidden Markov models (HMMs) with Gaussian emission probabilities to model the dependence between protein chemical shifts and secondary structure. The continuous emission probabilities are modeled as conditional probabilities for a given amino acid and secondary structure type. Using these distributions as outputs of first‐ and second‐order HMMs, we achieve a prediction accuracy of 82.3%, which is competitive with existing methods for predicting secondary structure from protein chemical shifts. Incorporation of sequence‐based secondary structure prediction into our HMM improves the prediction accuracy to 84.0%. Our findings suggest that an HMM with correlated Gaussian distributions conditioned on the secondary structure provides an adequate generative model of chemical shifts. Proteins 2013; © 2012 Wiley Periodicals, Inc.  相似文献   

4.
Cook RJ  Ng ET  Meade MO 《Biometrics》2000,56(4):1109-1117
We describe a method for making inferences about the joint operating characteristics of multiple diagnostic tests applied longitudinally and in the absence of a definitive reference test. Log-linear models are adopted for the classification distributions conditional on the latent state, where inclusion of appropriate interaction terms accommodates conditional dependencies among the tests. A marginal likelihood is constructed by marginalizing over a latent two-state Markov process. Specific latent processes we consider include a first-order Markov model, a second-order Markov model, and a time-nonhomogeneous Markov model, although the method is described in full generality. Adaptations to handle missing data are described. Model diagnostics are considered based on the bootstrap distribution of conditional residuals. The methods are illustrated by application to a study of diffuse bilateral infiltrates among patients in intensive care wards in which the objective was to assess aspects of validity and clinical agreement.  相似文献   

5.
The amino-acid sequence of human glutathione reductase was measured according to two- and three-amino-acid sequences. The measured frequency and probability were compared with predicted frequency and probability. Of 477 two-amino-acid sequences in human glutathione reductase, 176 (36.897%) and 90 (18.868%) sequences can be explained by the predicted frequency and the predicted probability according to a purely random mechanism. Of 477 measured first Markov transition probabilities for the second amino acid in two-amino-acid sequences, 1 (0.210%) measured first Markov transition probability matches the predicted conditional probability and can therefore be explained by a purely random mechanism. No more than two-amino-acid sequences can be explained by a purely random mechanism.  相似文献   

6.
A new method has been developed to compute the probability that each amino acid in a protein sequence is in a particular secondary structural element. Each of these probabilities is computed using the entire sequence and a set of predefined structural class models. This set of structural classes is patterned after Jane Richardson''s taxonomy for the domains of globular proteins. For each structural class considered, a mathematical model is constructed to represent constraints on the pattern of secondary structural elements characteristic of that class. These are stochastic models having discrete state spaces (referred to as hidden Markov models by researchers in signal processing and automatic speech recognition). Each model is a mathematical generator of amino acid sequences; the sequence under consideration is modeled as having been generated by one model in the set of candidates. The probability that each model generated the given sequence is computed using a filtering algorithm. The protein is then classified as belonging to the structural class having the most probable model. The secondary structure of the sequence is then analyzed using a "smoothing" algorithm that is optimal for that structural class model. For each residue position in the sequence, the smoother computes the probability that the residue is contained within each of the defined secondary structural elements of the model. This method has two important advantages: (1) the probability of each residue being in each of the modeled secondary structural elements is computed using the totality of the amino acid sequence, and (2) these probabilities are consistent with prior knowledge of realizable domain folds as encoded in each model. As an example of the method''s utility, we present its application to flavodoxin, a prototypical alpha/beta protein having a central beta-sheet, and to thioredoxin, which belongs to a similar structural class but shares no significant sequence similarity.  相似文献   

7.
Longitudinal ordinal data are common in many scientific studies, including those of multiple sclerosis (MS), and are frequently modeled using Markov dependency. Several authors have proposed random-effects Markov models to account for heterogeneity in the population. In this paper, we go one step further and study prediction based on random-effects Markov models. In particular, we show how to calculate the probabilities of future events and confidence intervals for those probabilities, given observed data on the ordinal outcome and a set of covariates, and how to update them over time. We discuss the usefulness of depicting these probabilities for visualization and interpretation of model results and illustrate our method using data from a phase III clinical trial that evaluated the utility of interferon beta-1a (trademark Avonex) to MS patients of type relapsing-remitting.  相似文献   

8.
A sequence-coupled (Markov chain) model is proposed to predict the cleavage sites in proteins by proteases with extended specificity subsites. In addition to the probability of an amino acid occurring at each of these subsites as observed from a training set of oligopeptides known cleavable by HIV protease, the conditional probabilities as reflected by the neighbor-coupled effect along the subsite sequence are also taken into account. These conditional probabilities are derived from an expanded training set consisting of sufficiently large peptide sequences generated by the Monte Carlo sampling process. Very high accuracy was obtained in predicting protein cleavage sites by both HIV-1 and HIV-2 proteases. The new method provides a rapid and accurate means for analyzing the specificity of HIV protease, and hence can be used to help find effective inhibitors of HIV protease as potential drugs against AIDS. The principle of this method can also be used to study the specificity of any multisubsite enzyme.  相似文献   

9.
Most ab initio gene predictors use a probabilistic sequence model, typically a hidden Markov model, to combine separately trained models of genomic signals and content. By combining separate models of relevant genomic features, such gene predictors can exploit small training sets and incomplete annotations, and can be trained fairly efficiently. However, that type of piecewise training does not optimize prediction accuracy and has difficulty in accounting for statistical dependencies among different parts of the gene model. With genomic information being created at an ever-increasing rate, it is worth investigating alternative approaches in which many different types of genomic evidence, with complex statistical dependencies, can be integrated by discriminative learning to maximize annotation accuracy. Among discriminative learning methods, large-margin classifiers have become prominent because of the success of support vector machines (SVM) in many classification tasks. We describe CRAIG, a new program for ab initio gene prediction based on a conditional random field model with semi-Markov structure that is trained with an online large-margin algorithm related to multiclass SVMs. Our experiments on benchmark vertebrate datasets and on regions from the ENCODE project show significant improvements in prediction accuracy over published gene predictors that use intrinsic features only, particularly at the gene level and on genes with long introns.  相似文献   

10.
In this paper, we develop a segmental semi-Markov model (SSMM) for protein secondary structure prediction which incorporates multiple sequence alignment profiles with the purpose of improving the predictive performance. The segmental model is a generalization of the hidden Markov model where a hidden state generates segments of various length and secondary structure type. A novel parameterized model is proposed for the likelihood function that explicitly represents multiple sequence alignment profiles to capture the segmental conformation. Numerical results on benchmark data sets show that incorporating the profiles results in substantial improvements and the generalization performance is promising. By incorporating the information from long range interactions in /spl beta/-sheets, this model is also capable of carrying out inference on contact maps. This is an important advantage of probabilistic generative models over the traditional discriminative approach to protein secondary structure prediction. The Web server of our algorithm and supplementary materials are available at http://public.kgi.edu/-wild/bsm.html.  相似文献   

11.
Simple hidden Markov models are proposed for predicting secondary structure of a protein from its amino acid sequence. Since the length of protein conformation segments varies in a narrow range, we ignore the duration effect of length distribution, and focus on inclusion of short range correlations of residues and of conformation states in the models. Conformation-independent and -dependent amino acid coarse-graining schemes are designed for the models by means of proper mutual information. We compare models of different level of complexity, and establish a practical model with a high prediction accuracy.  相似文献   

12.
Gene-Gene dependency plays a very important role in system biology as it pertains to the crucial understanding of different biological mechanisms. Time-course microarray data provides a new platform useful to reveal the dynamic mechanism of gene-gene dependencies. Existing interaction measures are mostly based on association measures, such as Pearson or Spearman correlations. However, it is well known that such interaction measures can only capture linear or monotonic dependency relationships but not for nonlinear combinatorial dependency relationships. With the invocation of hidden Markov models, we propose a new measure of pairwise dependency based on transition probabilities. The new dynamic interaction measure checks whether or not the joint transition kernel of the bivariate state variables is the product of two marginal transition kernels. This new measure enables us not only to evaluate the strength, but also to infer the details of gene dependencies. It reveals nonlinear combinatorial dependency structure in two aspects: between two genes and across adjacent time points. We conduct a bootstrap-based test for presence/absence of the dependency between every pair of genes. Simulation studies and real biological data analysis demonstrate the application of the proposed method. The software package is available under request.  相似文献   

13.
We introduce a new approach to learning statistical models from multiple sequence alignments (MSA) of proteins. Our method, called GREMLIN (Generative REgularized ModeLs of proteINs), learns an undirected probabilistic graphical model of the amino acid composition within the MSA. The resulting model encodes both the position-specific conservation statistics and the correlated mutation statistics between sequential and long-range pairs of residues. Existing techniques for learning graphical models from MSA either make strong, and often inappropriate assumptions about the conditional independencies within the MSA (e.g., Hidden Markov Models), or else use suboptimal algorithms to learn the parameters of the model. In contrast, GREMLIN makes no a priori assumptions about the conditional independencies within the MSA. We formulate and solve a convex optimization problem, thus guaranteeing that we find a globally optimal model at convergence. The resulting model is also generative, allowing for the design of new protein sequences that have the same statistical properties as those in the MSA. We perform a detailed analysis of covariation statistics on the extensively studied WW and PDZ domains and show that our method out-performs an existing algorithm for learning undirected probabilistic graphical models from MSA. We then apply our approach to 71 additional families from the PFAM database and demonstrate that the resulting models significantly out-perform Hidden Markov Models in terms of predictive accuracy.  相似文献   

14.
The repeated amino-acid sequences in Citrobacter Freundii beta-lactamase may be indispensable for its function, because such repetitions cannot be simply attributed to a chance. In order to fully explore the functional units in Citrobacter Freundii beta-lactamase, it may need to analyse all the amino acid pairs, triplets, etc. along Citrobacter Freundii beta-lactamase from one terminal to the other terminal, to count their frequencies and calculate their probabilities. The amino-acid sequence of Citrobacter Freundii beta-lactamase was counted according to two-, three- and four-amino-acid sequences. The counted frequency and probability were compared with the predicted frequency and probability. The amino acid sequences, which appear in Citrobacter Freundii beta-lactamase and can be predicted from its amino acid composition according to a purely random mechanism, should not be deliberately evolved and conserved. By contrast, the amino acid sequences, which appear in Citrobacter Freundii beta-lactamase but cannot be predicted from its amino acid composition according to a purely random mechanism, should be deliberately evolved and conversed. Accordingly 99 (26.053%) and 33 (8.684%) of 380 two-amino-acid sequences can be predicted by the frequency and probability according to a purely random mechanism. Some kinds of amino acid sequences, which absent in Citrobacter Freundii beta-lactamase and can be predicted from its amino acid composition according to a purely random mechanism, should not be deliberately excluded from Citrobacter Freundii beta-lactamase. By contrast, some kinds of amino acid sequences, which absent in Citrobacter Freundii beta-lactamase and cannot be predicted from its amino acid composition according to a purely random mechanism, should be deliberately excluded from Citrobacter Freundii beta-lactamase. Accordingly 89 (48.370%) and 41 (22.283%) of 184 kinds of absent two-amino-acid sequences can be predicted by the frequency and probability according to a purely random mechanism, and 7236 (99.848%) of 7247 kinds of absent three-amino-acid sequences can be predicted by the frequency according to a purely random mechanism. The amino acids, whose probabilities in following certain preceding amino acids can be predicted from Citrobacter Freundii beta-lactamase amino acid composition according to a purely random mechanism, should not be deliberately evolved and conversed, accordingly 2 (0.526%) of 380 counted first order Markov transition probabilities for the second amino acid in two-amino-acid sequences match the predicted conditional probabilities.  相似文献   

15.
Protein structure prediction methods typically use statistical potentials, which rely on statistics derived from a database of know protein structures. In the vast majority of cases, these potentials involve pairwise distances or contacts between amino acids or atoms. Although some potentials beyond pairwise interactions have been described, the formulation of a general multibody potential is seen as intractable due to the perceived limited amount of data. In this article, we show that it is possible to formulate a probabilistic model of higher order interactions in proteins, without arbitrarily limiting the number of contacts. The success of this approach is based on replacing a naive table‐based approach with a simple hierarchical model involving suitable probability distributions and conditional independence assumptions. The model captures the joint probability distribution of an amino acid and its neighbors, local structure and solvent exposure. We show that this model can be used to approximate the conditional probability distribution of an amino acid sequence given a structure using a pseudo‐likelihood approach. We verify the model by decoy recognition and site‐specific amino acid predictions. Our coarse‐grained model is compared to state‐of‐art methods that use full atomic detail. This article illustrates how the use of simple probabilistic models can lead to new opportunities in the treatment of nonlocal interactions in knowledge‐based protein structure prediction and design. Proteins 2013; 81:1340–1350. © 2013 Wiley Periodicals, Inc.  相似文献   

16.
MOTIVATION: The Bayesian network approach is a framework which combines graphical representation and probability theory, which includes, as a special case, hidden Markov models. Hidden Markov models trained on amino acid sequence or secondary structure data alone have been shown to have potential for addressing the problem of protein fold and superfamily classification. RESULTS: This paper describes a novel implementation of a Bayesian network which simultaneously learns amino acid sequence, secondary structure and residue accessibility for proteins of known three-dimensional structure. An awareness of the errors inherent in predicted secondary structure may be incorporated into the model by means of a confusion matrix. Training and validation data have been derived for a number of protein superfamilies from the Structural Classification of Proteins (SCOP) database. Cross validation results using posterior probability classification demonstrate that the Bayesian network performs better in classifying proteins of known structural superfamily than a hidden Markov model trained on amino acid sequences alone.  相似文献   

17.
An increased availability of genotypes at marker loci has prompted the development of models that include the effect of individual genes. Selection based on these models is known as marker-assisted selection (MAS). MAS is known to be efficient especially for traits that have low heritability and non-additive gene action. BLUP methodology under non-additive gene action is not feasible for large inbred or crossbred pedigrees. It is easy to incorporate non-additive gene action in a finite locus model. Under such a model, the unobservable genotypic values can be predicted using the conditional mean of the genotypic values given the data. To compute this conditional mean, conditional genotype probabilities must be computed. In this study these probabilities were computed using iterative peeling, and three Markov chain Monte Carlo (MCMC) methods – scalar Gibbs, blocking Gibbs, and a sampler that combines the Elston Stewart algorithm with iterative peeling (ESIP). The performance of these four methods was assessed using simulated data. For pedigrees with loops, iterative peeling fails to provide accurate genotype probability estimates for some pedigree members. Also, computing time is exponentially related to the number of loci in the model. For MCMC methods, a linear relationship can be maintained by sampling genotypes one locus at a time. Out of the three MCMC methods considered, ESIP, performed the best while scalar Gibbs performed the worst.  相似文献   

18.
19.
20.
Understanding and predicting protein structures depend on the complexity and the accuracy of the models used to represent them. We have recently set up a Hidden Markov Model to optimally compress protein three-dimensional conformations into a one-dimensional series of letters of a structural alphabet. Such a model learns simultaneously the shape of representative structural letters describing the local conformation and the logic of their connections, i.e. the transition matrix between the letters. Here, we move one step further and report some evidence that such a model of protein local architecture also captures some accurate amino acid features. All the letters have specific and distinct amino acid distributions. Moreover, we show that words of amino acids can have significant propensities for some letters. Perspectives point towards the prediction of the series of letters describing the structure of a protein from its amino acid sequence.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号