首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We present an efficient algorithm for statistical multiple alignment based on the TKF91 model of Thorne, Kishino, and Felsenstein (1991) on an arbitrary k-leaved phylogenetic tree. The existing algorithms use a hidden Markov model approach, which requires at least O( radical 5(k)) states and leads to a time complexity of O(5(k)L(k)), where L is the geometric mean sequence length. Using a combinatorial technique reminiscent of inclusion/exclusion, we are able to sum away the states, thus improving the time complexity to O(2(k)L(k)) and considerably reducing memory requirements. This makes statistical multiple alignment under the TKF91 model a definite practical possibility in the case of a phylogenetic tree with a modest number of leaves.  相似文献   

2.
The insertion-deletion model developed by Thorne, Kishino and Felsenstein (1991, J. Mol. Evol., 33, 114–124; the TKF91 model) provides a statistical framework of two sequences. The statistical alignment of a set of sequences related by a star tree is a generalization of this model. The known algorithm computes the probability of a set of such sequences in O(l 2k ) time, where l is the geometric mean of the sequence lengths and k is the number of sequences. An improved algorithm is presented whose running time is only O(22k l k).  相似文献   

3.
Using evolutionary Expectation Maximization to estimate indel rates   总被引:4,自引:0,他引:4  
MOTIVATION: The Expectation Maximization (EM) algorithm, in the form of the Baum-Welch algorithm (for hidden Markov models) or the Inside-Outside algorithm (for stochastic context-free grammars), is a powerful way to estimate the parameters of stochastic grammars for biological sequence analysis. To use this algorithm for multiple-sequence evolutionary modelling, it would be useful to apply the EM algorithm to estimate not only the probability parameters of the stochastic grammar, but also the instantaneous mutation rates of the underlying evolutionary model (to facilitate the development of stochastic grammars based on phylogenetic trees, also known as Statistical Alignment). Recently, we showed how to do this for the point substitution component of the evolutionary process; here, we extend these results to the indel process. RESULTS: We present an algorithm for maximum-likelihood estimation of insertion and deletion rates from multiple sequence alignments, using EM, under the single-residue indel model owing to Thorne, Kishino and Felsenstein (the 'TKF91' model). The algorithm converges extremely rapidly, gives accurate results on simulated data that are an improvement over parsimonious estimates (which are shown to underestimate the true indel rate), and gives plausible results on experimental data (coronavirus envelope domains). Owing to the algorithm's close similarity to the Baum-Welch algorithm for training hidden Markov models, it can be used in an 'unsupervised' fashion to estimate rates for unaligned sequences, or estimate several sets of rates for sequences with heterogenous rates. AVAILABILITY: Software implementing the algorithm and the benchmark is available under GPL from http://www.biowiki.org/  相似文献   

4.
Summary A maximum likelihood method for inferring evolutionary trees from DNA sequence data was developed by Felsenstein (1981). In evaluating the extent to which the maximum likelihood tree is a significantly better representation of the true tree, it is important to estimate the variance of the difference between log likelihood of different tree topologies. Bootstrap resampling can be used for this purpose (Hasegawa et al. 1988; Hasegawa and Kishino 1989), but it imposes a great computation burden. To overcome this difficulty, we developed a new method for estimating the variance by expressing it explicitly.The method was applied to DNA sequence data from primates in order to evaluate the maximum likelihood branching order among Hominoidea. It was shown that, although the orangutan is convincingly placed as an outgroup of a human and African apes clade, the branching order among human, chimpanzee, and gorilla cannot be determined confidently from the DNA sequence data presently available when the evolutionary rate constancy is not assumed.  相似文献   

5.
An evolutionary model for maximum likelihood alignment of DNA sequences   总被引:16,自引:0,他引:16  
Summary Most algorithms for the alignment of biological sequences are not derived from an evolutionary model. Consequently, these alignment algorithms lack a strong statistical basis. A maximum likelihood method for the alignment of two DNA sequences is presented. This method is based upon a statistical model of DNA sequence evolution for which we have obtained explicit transition probabilities. The evolutionary model can also be used as the basis of procedures that estimate the evolutionary parameters relevant to a pair of unaligned DNA sequences. A parameter-estimation approach which takes into account all possible alignments between two sequences is introduced; the danger of estimating evolutionary parameters from a single alignment is discussed.  相似文献   

6.
Bernsel A  Viklund H  Elofsson A 《Proteins》2008,71(3):1387-1399
Compared with globular proteins, transmembrane proteins are surrounded by a more intricate environment and, consequently, amino acid composition varies between the different compartments. Existing algorithms for homology detection are generally developed with globular proteins in mind and may not be optimal to detect distant homology between transmembrane proteins. Here, we introduce a new profile-profile based alignment method for remote homology detection of transmembrane proteins in a hidden Markov model framework that takes advantage of the sequence constraints placed by the hydrophobic interior of the membrane. We expect that, for distant membrane protein homologs, even if the sequences have diverged too far to be recognized, the hydrophobicity pattern and the transmembrane topology are better conserved. By using this information in parallel with sequence information, we show that both sensitivity and specificity can be substantially improved for remote homology detection in two independent test sets. In addition, we show that alignment quality can be improved for the most distant homologs in a public dataset of membrane protein structures. Applying the method to the Pfam domain database, we are able to suggest new putative evolutionary relationships for a few relatively uncharacterized protein domain families, of which several are confirmed by other methods. The method is called Searcher for Homology Relationships of Integral Membrane Proteins (SHRIMP) and is available for download at http://www.sbc.su.se/shrimp/.  相似文献   

7.
A variety of analytical methods is available for branch testing in distance-based phylogenies. However, these methods are rarely used, possibly because the estimation of some of their statistics, especially the covariances, is not always feasible. We show that these difficulties can be overcome if some simplifying assumptions are made, namely distance independence. The weighted least-squares likelihood ratio test (WLS-LRT) we propose is easy to perform, using only the distances and some of their associated variances. If no variances are known, the use of the Felsenstein F-test, also based on weighted least squares, is discussed. Using simulated data and a data set of 43 mammalian mitochondrial sequences we demonstrate that the WLS-LRT performs as well as the generalized least-squares test, and indeed better for a large number of taxa data set. We thus show that the assumption of independence does not negatively affect the reliability or the accuracy of the least-squares approach. The results of the WLS-LRT are no worse than the results of the bootstrap methods, such as the Felsenstein bootstrap selection probability test and the Dopazo test. We also show that WLS-LRT can be applied in instances where other analytical methods are inappropriate. This point is illustrated by analyzing the relationships between human immunodeficiency virus type 1 (HIV-1) sequences isolated from various organs of different individuals.  相似文献   

8.
C Sander  R Schneider 《Proteins》1991,9(1):56-68
The database of known protein three-dimensional structures can be significantly increased by the use of sequence homology, based on the following observations. (1) The database of known sequences, currently at more than 12,000 proteins, is two orders of magnitude larger than the database of known structures. (2) The currently most powerful method of predicting protein structures is model building by homology. (3) Structural homology can be inferred from the level of sequence similarity. (4) The threshold of sequence similarity sufficient for structural homology depends strongly on the length of the alignment. Here, we first quantify the relation between sequence similarity, structure similarity, and alignment length by an exhaustive survey of alignments between proteins of known structure and report a homology threshold curve as a function of alignment length. We then produce a database of homology-derived secondary structure of proteins (HSSP) by aligning to each protein of known structure all sequences deemed homologous on the basis of the threshold curve. For each known protein structure, the derived database contains the aligned sequences, secondary structure, sequence variability, and sequence profile. Tertiary structures of the aligned sequences are implied, but not modeled explicitly. The database effectively increases the number of known protein structures by a factor of five to more than 1800. The results may be useful in assessing the structural significance of matches in sequence database searches, in deriving preferences and patterns for structure prediction, in elucidating the structural role of conserved residues, and in modeling three-dimensional detail by homology.  相似文献   

9.
The extent of homology between two sequences is generally expressed quantitatively by using a set of rules to assign to aligned residue pairs, numerical values that depend on some measure of the similarity between the pairs. A homology score for the alignment is obtained by summing the numerical value for each pair, and possibly also adding some appropriate penalty for insertions and deletions in the alignment. Whatever the set of rules used in arriving at the best homology score, the fundamental question that remains is whether the score implies that the sequences are in some way related. In the absence of biological criteria, such a question is generally given a preliminary answer by a Monte Carlo evaluation of the statistical significance of the score. By randomizing the two sequences a large number of times and noting the homology on each throw, the probability of a given score can be obtained for each sequence, and hence the likelihood that the observed homology could have occurred by chance for sequences with the given composition, is easily assessed. In addition to this statistical assessment, however, there is a second statistical question that must also be evaluated—particularly for short homologous stretches (<15 residues). When homology searches are performed against databases containing several hundred thousand residues, an important number to know is the probability that the homologous sequence would have occurred simply as the result of the large number of arrangements of short sequences that must be present in any large collection of disparate sequences. In this note we evaluate different versions of this question using different sets of rules. Our results indicate that homologies of roughly six or more residues out of ten are statistically significant and that the best procedure for finding homologies and determining their significance is with the use of the Dayhoff mutation matrix, assigning a weight of at least —8 as a penalty for gaps.  相似文献   

10.
Statistical and learning techniques are becoming increasingly popular for different tasks in bioinformatics. Many of the most powerful statistical and learning techniques are applicable to points in a Euclidean space but not directly applicable to discrete sequences such as protein sequences. One way to apply these techniques to protein sequences is to embed the sequences into a Euclidean space and then apply these techniques to the embedded points. In this work we introduce a biologically motivated sequence embedding, the homology kernel, which takes into account intuitions from local alignment, sequence homology, and predicted secondary structure. This embedding allows us to directly apply learning techniques to protein sequences. We apply the homology kernel in several ways. We demonstrate how the homology kernel can be used for protein family classification and outperforms state-of-the-art methods for remote homology detection. We show that the homology kernel can be used for secondary structure prediction and is competitive with popular secondary structure prediction methods. Finally, we show how the homology kernel can be used to incorporate information from homologous sequences in local sequence alignment.  相似文献   

11.
A fundamental task in sequence analysis is to calculate the probability of a multiple alignment given a phylogenetic tree relating the sequences and an evolutionary model describing how sequences change over time. However, the most widely used phylogenetic models only account for residue substitution events. We describe a probabilistic model of a multiple sequence alignment that accounts for insertion and deletion events in addition to substitutions, given a phylogenetic tree, using a rate matrix augmented by the gap character. Starting from a continuous Markov process, we construct a non-reversible generative (birth-death) evolutionary model for insertions and deletions. The model assumes that insertion and deletion events occur one residue at a time. We apply this model to phylogenetic tree inference by extending the program dnaml in phylip. Using standard benchmarking methods on simulated data and a new "concordance test" benchmark on real ribosomal RNA alignments, we show that the extended program dnamlepsilon improves accuracy relative to the usual approach of ignoring gaps, while retaining the computational efficiency of the Felsenstein peeling algorithm.  相似文献   

12.
Prediction of membrane segments in sequences of membrane proteins is well known and important problem. Accuracy of the solution of this problem by methods that don't use homology search in additional data bank can be improved. There is a lack of testing data in this area because of small amount of real structures of membrane proteins. In this work, we create a testing set of structural alignments of membrane proteins, in which positioning of the membrane segments reflects agreement of known 3D-structures of proteins in the alignment. We propose a method for predicting position of membrane segments in multiple alignment based on forward-backward algorithm from HMM theory. This method not only allows to predict positions of membrane segments but also forms probability membrane profile, which can be used in multiple alignment methods that take into account secondary structure information about sequences. Method is implemented in computer program available on the World-Wide Web site http://bioinf.fbb.msu.ru/fwdbck/. Proposed method provides results better than MEMSAT method, which is nearly only tool for prediction of membrane segments in multiple alignments without additional homology search.  相似文献   

13.
In this study, we investigate the extent to which techniques for homology modeling that were developed for water-soluble proteins are appropriate for membrane proteins as well. To this end we present an assessment of current strategies for homology modeling of membrane proteins and introduce a benchmark data set of homologous membrane protein structures, called HOMEP. First, we use HOMEP to reveal the relationship between sequence identity and structural similarity in membrane proteins. This analysis indicates that homology modeling is at least as applicable to membrane proteins as it is to water-soluble proteins and that acceptable models (with C alpha-RMSD values to the native of 2 A or less in the transmembrane regions) may be obtained for template sequence identities of 30% or higher if an accurate alignment of the sequences is used. Second, we show that secondary-structure prediction algorithms that were developed for water-soluble proteins perform approximately as well for membrane proteins. Third, we provide a comparison of a set of commonly used sequence alignment algorithms as applied to membrane proteins. We find that high-accuracy alignments of membrane protein sequences can be obtained using state-of-the-art profile-to-profile methods that were developed for water-soluble proteins. Improvements are observed when weights derived from the secondary structure of the query and the template are used in the scoring of the alignment, a result which relies on the accuracy of the secondary-structure prediction of the query sequence. The most accurate alignments were obtained using template profiles constructed with the aid of structural alignments. In contrast, a simple sequence-to-sequence alignment algorithm, using a membrane protein-specific substitution matrix, shows no improvement in alignment accuracy. We suggest that profile-to-profile alignment methods should be adopted to maximize the accuracy of homology models of membrane proteins.  相似文献   

14.
Motivation: The topic of this paper is the estimation of alignments and mutation rates based on stochastic sequence-evolution models that allow insertions and deletions of subsequences ('fragments') and not just single bases. The model we propose is a variant of a model introduced by Thorne et al., (J. Mol. Evol., 34, 3-16, 1992). The computational tractability of the model depends on certain restrictions in the insertion/deletion process; possible effects we discuss. Results: The process of fragment insertion and deletion in the sequence-evolution model induces a hidden Markov structure at the level of alignments and thus makes possible efficient statistical alignment algorithms. As an example we apply a sampling procedure to assess the variability in alignment and mutation parameter estimates for HVR1 sequences of human and orangutan, improving results of previous work. Simulation studies give evidence that estimation methods based on the proposed model also give satisfactory results when applied to data for which the restrictions in the insertion/deletion process do not hold. Availability: The source code of the software for sampling alignments and mutation rates for a pair of DNA sequences according to the fragment insertion and deletion model is freely available from http://www.math.uni-frankfurt.de/~stoch/software/mcmcsalut under the terms of the GNU public license (GPL, 2000).  相似文献   

15.
Globin-like蛋白质折叠类型识别   总被引:2,自引:0,他引:2  
蛋白质折叠类型识别是蛋白质结构研究的重要内容.以SCOP中的Globin-like折叠为研究对象,选择其中序列同一性小于25%的17个代表性蛋白质为训练集,采用机器和人工结合的办法进行结构比对,产生序列排比,经过训练得到了适合Globin-like折叠的概形隐马尔科夫模型(profile HMM)用于该折叠类型的识别.以Astrall.65中的68057个结构域样本进行检验,识别敏感度为99.64%,特异性100%.在折叠类型水平上,与Pfam和SUPERFAMILY单纯使用序列比对构建的HMM相比,所用模型由多于100个归为一个,仍然保持了很高的识别效果.结果表明:对序列相似度很低但具有相同折叠类型的蛋白质,可以通过引入结构比对的方法建立统一的HMM模型,实现高准确率的折叠类型识别.  相似文献   

16.
This work presents a novel pairwise statistical alignment method based on an explicit evolutionary model of insertions and deletions (indels). Indel events of any length are possible according to a geometric distribution. The geometric distribution parameter, the indel rate, and the evolutionary time are all maximum likelihood estimated from the sequences being aligned. Probability calculations are done using a pair hidden Markov model (HMM) with transition probabilities calculated from the indel parameters. Equations for the transition probabilities make the pair HMM closely approximate the specified indel model. The method provides an optimal alignment, its likelihood, the likelihood of all possible alignments, and the reliability of individual alignment regions. Human alpha and beta-hemoglobin sequences are aligned, as an illustration of the potential utility of this pair HMM approach.  相似文献   

17.
A total of 20%-25% of the proteins in a typical genome are helical membrane proteins. The transmembrane regions of these proteins have markedly different properties when compared with globular proteins. This presents a problem when homology search algorithms optimized for globular proteins are applied to membrane proteins. Here we present modifications of the standard Smith-Waterman and profile search algorithms that significantly improve the detection of related membrane proteins. The improvement is based on the inclusion of information about predicted transmembrane segments in the alignment algorithm. This is done by simply increasing the alignment score if two residues predicted to belong to transmembrane segments are aligned with each other. Benchmarking over a test set of G-protein-coupled receptor sequences shows that the number of false positives is significantly reduced in this way, both when closely related and distantly related proteins are searched for.  相似文献   

18.
We compared the predicted amino acid sequences of the vesicular stomatitis virus and rabies virus glycoproteins by using a computer program which provides an optimal alignment and a statistical significance for the match. Highly significant homology between these two proteins was detected, including identical positioning of one glycosylation site. A significant homology between the predicted amino acid sequences of vesicular stomatitis virus and influenza virus matrix proteins was also found.  相似文献   

19.
In order to study protein function and activity structural data is required. Since experimental structures are available for just a small fraction of all known protein sequences, computational methods such as protein modelling can provide useful information. Over the last few decades we have predicted, with homology modelling methods, the structures for numerous proteins. In this study we assess the structural quality and validity of the biological and medical interpretations and predictions made based on the models. All the models had correct scaffolding and were ranked at least as correct or good by numerical evaluators even though the sequence identity with the template was as low as 8%. The biological explanations made based on models were well in line with experimental structures and other experimental studies. Retrospective analysis of homology models indicates the power of protein modelling when made carefully from sequence alignment to model building and refinement. Modelling can be applied to studying and predicting different kinds of biological phenomena and according to our results it can be done so with success.  相似文献   

20.
Prediction of transmembrane (TM) segments of amino acid sequences of membrane proteins is a well-known and very important problem. The accuracy of its solution can be improved for approaches that do not use a homology search in an additional data bank. There is a lack of tested data in this area of research, because information on the structure of membrane proteins is scarce. In this work we created a test sample of structural alignments for membrane proteins. The TM segments of these proteins were mapped according to aligned 3D structures resolved for these proteins. A method for predicting TM segments in an alignment was developed on the basis of the forward-backward algorithm from the HMM theory. This method allows a user not only to predict TM segments, but also to create a probabilistic membrane profile, which can be employed in multiple alignment procedures taking the secondary structure of proteins into account. The method was implemented in a computer program available at http://bioinf.fbb.msu.ru/fwdbck/. It provides better results than the MEMSAT method, which is nearly the only tool predicting TM segments in multiple alignments, without a homology search.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号