首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Many phylogenetic inference methods are based on Markov models of sequence evolution. These are usually expressed in terms of a matrix (Q) of instantaneous rates of change but some models of amino acid replacement, most notably the PAM model of Dayhoff and colleagues, were originally published only in terms of time-dependent probability matrices (P(t)). Previously published methods for deriving Q have used eigen-decomposition of an approximation to P(t). We show that the commonly used value of t is too large to ensure convergence of the estimates of elements of Q. We describe two simpler alternative methods for deriving Q from information such as that published by Dayhoff and colleagues. Neither of these methods requires approximation or eigen-decomposition. We identify the methods used to derive various different versions of the Dayhoff model in current software, perform a comparison of existing and new implementations, and, to facilitate agreement among scientists using supposedly identical models, recommend that one of the new methods be used as a standard.  相似文献   

2.
Aligning amino acid sequences: comparison of commonly used methods   总被引:5,自引:0,他引:5  
We examined two extensive families of protein sequences using four different alignment schemes that employ various degrees of "weighting" in order to determine which approach is most sensitive in establishing relationships. All alignments used a similarity approach based on a general algorithm devised by Needleman and Wunsch. The approaches included a simple program, UM (unitary matrix), whereby only identities are scored; a scheme in which the genetic code is used as a basis for weighting (GC); another that employs a matrix based on structural similarity of amino acids taken together with the genetic basis of mutation (SG); and a fourth that uses the empirical log-odds matrix (LOM) developed by Dayhoff on the basis of observed amino acid replacements. The two sequence families examined were (a) nine different globins and (b) nine different tyrosine kinase-like proteins. It was assumed a priori that all members of a family share common ancestry. In cases where two sequences were more than 30% identical, alignments by all four methods were almost always the same. In cases where the percentage identity was less than 20%, however, there were often significant differences in the alignments. On the average, the Dayhoff LOM approach was the most effective in verifying distant relationships, as judged by an empirical "jumbling test." This was not universally the case, however, and in some instances the simple UM was actually as good or better. Trees constructed on the basis of the various alignments differed with regard to their limb lengths, but had essentially the same branching orders. We suggest some reasons for the different effectivenesses of the four approaches in the two different sequence settings, and offer some rules of thumb for assessing the significance of sequence relationships.  相似文献   

3.
Three Markov models (Dayhoff, Proportional and Poisson models; Hasegawa et al., 1992a) for amino acid substitution during evolution were used for maximum likelihood analyses of proteins coded for in mitochondrial DNA in estimating a phylogenetic tree among human, bovine and murids (mouse and rat) with chicken as an outgroup. It turned out that Dayhoff model is the most appropriate model among the alternatives in approximating the amino acid substitutions of proteins coded for in mitochondrial DNA. In spite of the presence of the complete sequence data of mitochondrial genomes, we could not resolve the trichotomy among human, bovine and murids, probably because the time length separating two branching events among these three lines was short and because chicken is too distant from mammals to be used as an outgroup. It was suggested that the average substitution rate of amino acids coded for in mitochondrial DNA is lower along the bovine line than those along the human or murid lines. Advantages of amino acid sequence analysis over nucleotide sequence analysis in phylogenetic study were discussed.  相似文献   

4.
Several choices of amino acid substitution matrices are currently available for searching and alignment applications. These choices were evaluated using the BLAST searching program, which is extremely sensitive to differences among matrices, and the Prosite catalog, which lists members of hundreds of protein families. Matrices derived directly from either sequence-based or structurebased alignments of distantly related proteins performed much better overall than extrapolated matrices based on the Dayhoff evolutionary model. Similar results were obtained with the FASTA searching program. Improved performance appears to be general rather than family-specific, reflecting improved accuracy in scoring alignments. An implementation of a multiple matrix strategy was also tested. While no combination of three matrices performed as well as the single best matrix, BLOSUM 62, good results were obtained using a combination of sequence-based and structure-based matrices. This hybrid set of matrices is likely to be useful in certain situations. Our results illustrate the importance of matrix selection and value of a comprehensive approach to evaluation of protein comparison tools. © 1993 Wiley-Liss, Inc.  相似文献   

5.
6.
A sensitive procedure to compare amino acid sequences   总被引:17,自引:0,他引:17  
Methods are discussed that provide sensitive criteria for detection of weak sequence homologies. They are based on the Dayhoff relatedness odds amino acid exchange matrix and certain residue physical characteristics. The search procedure uses several residue probe lengths in comparing all possible segments of two protein sequences, and search plots are shown with peak values displayed over the entire search length. Alignments are automatically effected using the highest search matrix values and without the necessity of gap penalties. Tests for significance are derived from actual protein sequences rather than a random shuffling procedure.  相似文献   

7.
The nucleotide sequence was determined for part of the Klebsiella pneumoniae nif gene cluster containing the 3' end of the nifD gene and the entire length of the nifK gene (encoding the alpha- and beta-subunits of the nitrogenase MoFe protein respectively), as well as the putative start of the nifY gene, a gene of as yet unknown function. A broad-based comparison of a number of MoFe protein alpha-subunits, beta-subunits and alpha-versus beta-subunits was carried out by the use of a computer program that simultaneously aligns three protein sequences according to the mutation data matrix of Dayhoff. A new kind of quantitative statistical measure of the similarity between the aligned sequences was obtained by calculating and plotting standardized similarity scores for overlapping segments along the aligned proteins. This calculation determines if a test sequence is similar to the consensus sequence of two other proteins that are known to be related to each other. The different beta-subunits compared were found to be significantly similar along most of their sequence, with the exception of two relatively short regions centred around residues 225 and 300, which contain insertions/deletions. The overall pattern of similarity between different alpha-subunits exhibits resemblance to the overall pattern of similarity between different beta-subunits, including regions of low similarity centred around residues 225 and 340. Comparison of alpha-subunits with beta-subunits showed that a region of significant similarity between the two types of subunits was located approximately between residues 120 and 180 in both subunits, but other parts of the proteins were only marginally similar. These results provide insights into likely tertiary structural features of the MoFe protein subunits.  相似文献   

8.
Software has been developed to allow the use of a number ofparameters in the comparative representation of proteins incolor and monochrome dot matrices. They include the parametersof partial specific volume, residue bulkiness, the mean areaburied of side chains, seven additional hydropathy scales, mutability,polarity, secondary structure propensities, energy/residue,energy/atom, Rf values, the pKs at the N and C terminals, user-definedparameters and, if desired, randomly generated values. Manyof these parameters can be combined in n space using an algorithmbased on the Euclidian distance relationship in order to deriveconsensus values. The problem of scoring matched identitiesis addressed and the user may stipulate that they score 100on a 0–100 scale or be determined from the Dayhoff MDM78values with the rest of the matrix scaled appropriately. ThePAMs matrix has been incorporated in such a way to allow theuser to stipulate various PAM's values or estimated percentagedifference between two peptide sequences, and converting tolog odds values. In addition, the similarity ring developedby Swanson and the matrix proposed by Bacon and Anderson havebeen adapted for use in the program. Color indices have beenutilized to give a ‘third dimension’ to the projections,allowing the user to judge the degree of similarity of differentregions which are represented. The software also provides forthe plotting of nucleotides in which case color is used to codeindividual nucleotides, purines versus pyrimidines, or similarcolors are used to differentiate between A and T bases on theone hand, and G and C on the other. Received on December 31, 1987; accepted on May 18, 1988  相似文献   

9.
A simple and efficient method is described for analyzing quantitatively multiple protein sequence alignments and finding the most conserved blocks as well as the maxima of divergence within the set of aligned sequences. It consists of calculating the mean distance and the root-mean-square distance in each column of the multiple alignment, averaging the values in a window of defined length and plotting the results as a function of the position of the window. Due attention is paid to the presence of gaps in the columns. Several examples are provided, using the sequences of several cytochromes c, serine proteases, lysozymes and globins. Two distance matrices are compared, namely the matrix derived by Gribskov and Burgess from the Dayhoff matrix, and the Risler Structural Superposition Matrix. In each case, the divergence plots effectively point to the specific residues which are known to be essential for the catalytic activity of the proteins. In addition, the regions of maximum divergence are clearly delineated. Interestingly, they are generally observed in positions immediately flanking the most conserved blocks. The method should therefore be useful for delineating the peptide segments which will be good candidates for site-directed mutagenesis and for visualizing the evolutionary constraints along homologous polypeptide chains.  相似文献   

10.
Summary A statistical analysis of the data tabulated in the Atlas of Protein Sequence and Structure 1972 indicates that the observed frequency of occurrence of the tripeptides Asn-X-Ser and Asn-X-Thr is approximately one third of the expected in eukaryotic proteins, but in prokaryotic proteins the observation agrees closely with expectation. Thus the lowered frequency of these tripeptides found by Hunt and Dayhoff is restricted to eukaryotic proteins. Of all the Asn-X-Ser/Thr sequences examined, those which contain covalently attached carbohydrates are found only in the extracellular proteins of eukaryote. These observations are discussed in relation to the evolution of glycoproteins which seems to have occurred in the ancestor of eukaryotes after the divergence from prokaryotes.  相似文献   

11.
Deviations in the compositions of homologous proteins from a standard composition which minimizes these differences are characterized by two measures, the information gain and a generalization of the Dayhoff PAMs measure. It is shown that protein compositions cannot be understood as generated by a random process alone and that the proposed compositional analysis is sensitive enough to detect, in favourable cases, also the existence of specific adaptive processes. For α-crystallin A a previously unknown adaption of the composition is found and an explanation in terms of protein function proposed.  相似文献   

12.
Summary We examined two extensive families of protein sequences using four different alignment schemes that employ various degrees of weighting in order to determine which approach is most sensitive in establishing relationships. All alignments used a similarity approach based on a general algorithm devised by Needleman and Wunsch. The approaches included a simple program, UM (unitary matrix), whereby only identities are scored; a scheme in which the genetic code is used as a basis for weighting (GC); another that employs a matrix based on structural similarity of amino acids taken together with the genetic basis of mutation (SG); and a fourth that uses the empirical log-odds matrix (LOM) developed by Dayhoff on the basis of observed amino acid replacements. The two sequence families examined were (a) nine different globins and (b) nine different tyrosine kinase-like proteins. It was assumed a priori that all members of a family share common ancestry. In cases where two sequences were more than 30% identical, alignments by all four methods were almost always the same. In cases where the percentage identity was less than 20%, however, there were often significant differences in the alignments. On the average, the Dayhoff LOM approach was the most effective in verifying distant relationships, as judged by an empirical jumbling test. This was not universally the case, however, and in some instances the simple UM was actually as good or better. Trees constructed on the basis of the various alignments differed with regard to their limb lengths, but had essentially the same branching orders. We suggest some reasons for the different effectivenesses of the four approaches in the two different sequence settings, and offer some rules of thumb for assessing the significance of sequence relationships.  相似文献   

13.
The amino acid sequences of proteins provide rich information for inferring distant phylogenetic relationships and for predicting protein functions. Estimating the rate matrix of residue substitutions from amino acid sequences is also important because the rate matrix can be used to develop scoring matrices for sequence alignment. Here we use a continuous time Markov process to model the substitution rates of residues and develop a Bayesian Markov chain Monte Carlo method for rate estimation. We validate our method using simulated artificial protein sequences. Because different local regions such as binding surfaces and the protein interior core experience different selection pressures due to functional or stability constraints, we use our method to estimate the substitution rates of local regions. Our results show that the substitution rates are very different for residues in the buried core and residues on the solvent-exposed surfaces. In addition, the rest of the proteins on the binding surfaces also have very different substitution rates from residues. Based on these findings, we further develop a method for protein function prediction by surface matching using scoring matrices derived from estimated substitution rates for residues located on the binding surfaces. We show with examples that our method is effective in identifying functionally related proteins that have overall low sequence identity, a task known to be very challenging.  相似文献   

14.
Summary The nucleotide sequence of the gene encoding group A streptococcal pyrogenic exotoxin type A (SPE A) was determined by the dideoxy chain termination method. The first 30 residues of the translation product represented a hydrophobic signal peptide. The mature protein was 220 amino acids in length and had a molecular weight of 25,805. It has significant protein sequence homology with Staphylococcus aureus enterotoxin B but not with other proteins in the Dayhoff library.  相似文献   

15.
Summary We have found ragweed allergen Ra3 to be related to the type 1 copper proteins; it is most closely related to stellacyanin and basic blue protein. The type 1 copper proteins form a diverse group of proteins, most of which are involved in electron transport. However, key amino acids believed to be involved in copper binding are absent from the allergen sequence; thus, the allergen is not likely to be functionally related to the type 1 copper proteins. We have grouped these proteins into one superfamily and we depict the relationships among them by an evolutionary tree. As indicated by this tree, an ancient gene duplication resulted in the divergence of plastocyanin from the line leading to basic blue protein, stellacyanin, and allergen Ra3.This paper is dedicated to the memory of Professor Margaret O. Dayhoff, whose contributions to the study of protein evolution made this investigation possible  相似文献   

16.

Background  

Distance-based methods are popular for reconstructing evolutionary trees thanks to their speed and generality. A number of methods exist for estimating distances from sequence alignments, which often involves some sort of correction for multiple substitutions. The problem is to accurately estimate the number of true substitutions given an observed alignment. So far, the most accurate protein distance estimators have looked for the optimal matrix in a series of transition probability matrices, e.g. the Dayhoff series. The evolutionary distance between two aligned sequences is here estimated as the evolutionary distance of the optimal matrix. The optimal matrix can be found either by an iterative search for the Maximum Likelihood matrix, or by integration to find the Expected Distance. As a consequence, these methods are more complex to implement and computationally heavier than correction-based methods. Another problem is that the result may vary substantially depending on the evolutionary model used for the matrices. An ideal distance estimator should produce consistent and accurate distances independent of the evolutionary model used.  相似文献   

17.
Identifying the residues in a protein that are involved in protein-protein interaction and identifying the contact matrix for a pair of interacting proteins are two computational tasks at different levels of an in-depth analysis of protein-protein interaction. Various methods for solving these two problems have been reported in the literature. However, the interacting residue prediction and contact matrix prediction were handled by and large independently in those existing methods, though intuitively good prediction of interacting residues will help with predicting the contact matrix. In this work, we developed a novel protein interacting residue prediction system, contact matrix-interaction profile hidden Markov model (CM-ipHMM), with the integration of contact matrix prediction and the ipHMM interaction residue prediction. We propose to leverage what is learned from the contact matrix prediction and utilize the predicted contact matrix as “feedback” to enhance the interaction residue prediction. The CM-ipHMM model showed significant improvement over the previous method that uses the ipHMM for predicting interaction residues only. It indicates that the downstream contact matrix prediction could help the interaction site prediction.  相似文献   

18.
Vicatos S  Reddy BV  Kaznessis Y 《Proteins》2005,58(4):935-949
In this work we present a novel correlated mutations analysis (CMA) method that is significantly more accurate than previously reported CMA methods. Calculation of correlation coefficients is based on physicochemical properties of residues (predictors) and not on substitution matrices. This results in reliable prediction of pairs of residues that are distant in protein sequence but proximal in its three dimensional tertiary structure. Multiple sequence alignments (MSA) containing a sequence of known structure for 127 families from PFAM database have been selected so that all major protein architectures described in CATH classification database are represented. Protein sequences in the selected families were filtered so that only those evolutionarily close to the target protein remain in the MSA. The average accuracy obtained for the alpha beta class of proteins was 26.8% of predicted proximal pairs with average improvement over random accuracy (IOR) of 6.41. Average accuracy is 20.6% for the mainly beta class and 14.4% for the mainly alpha class. The optimum correlation coefficient cutoff (cc cutoff) was found to be around 0.65. The first predictor, which correlates to hydrophobicity, provides the most reliable results. The other two predictors give good predictions which can be used in conjunction to those of the first one. When stricter cc cutoff is chosen, the average accuracy increases significantly (38.76% for alpha beta class), but the trade off is a smaller number of predictions. The use of solvent accessible area estimations for filtering false positives out of the predictions is promising.  相似文献   

19.
ABSTRACT: BACKGROUND: RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition 'code' that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction. RESULTS: We provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (Naive Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequence-based classifiers that use a position-specific scoring matrix (PSSM)-based representation (PSSMSeq) outperform those that use an amino acid identity based representation (IDSeq) or a smoothed PSSM (SmoPSSMSeq); (ii) Structure-based classifiers that use smoothed PSSM representation (SmoPSSMStr) outperform those that use PSSM (PSSMStr) as well as sequence identity based representation (IDStr). PSSMSeq classifiers, when tested on an independent test set of 44 proteins, achieve performance that is comparable to that of three state-of-the-art structure-based predictors (including those that exploit geometric features) in terms of Matthews Correlation Coefficient (MCC), although the structure-based methods achieve substantially higher Specificity (albeit at the expense of Sensitivity) compared to sequence-based methods. We also find that the expected performance of the classifiers on a residue level can be markedly different from that on a protein level. Our experiments show that the classifiers trained on three different non-redundant protein-RNA interface datasets achieve comparable cross-validation performance. However, we find that the results are significantly affected by differences in the distance threshold used to define interface residues. CONCLUSIONS: Our results demonstrate that protein-RNA interface residue predictors that use a PSSM-based encoding of sequence windows outperform classifiers that use other encodings of sequence windows. While structure-based methods that exploit geometric features can yield significant increases in the Specificity of protein-RNA interface residue predictions, such increases are offset by decreases in Sensitivity. These results underscore the importance of comparing alternative methods using rigorous statistical procedures, multiple performance measures, and datasets that are constructed based on several alternative definitions of interface residues and redundancy cutoffs as well as including evaluations on independent test sets into the comparisons.  相似文献   

20.
An improved general amino acid replacement matrix   总被引:2,自引:0,他引:2  
Amino acid replacement matrices are an essential basis of protein phylogenetics. They are used to compute substitution probabilities along phylogeny branches and thus the likelihood of the data. They are also essential in protein alignment. A number of replacement matrices and methods to estimate these matrices from protein alignments have been proposed since the seminal work of Dayhoff et al. (1972). An important advance was achieved by Whelan and Goldman (2001) and their WAG matrix, thanks to an efficient maximum likelihood estimation approach that accounts for the phylogenies of sequences within each training alignment. We further refine this method by incorporating the variability of evolutionary rates across sites in the matrix estimation and using a much larger and diverse database than BRKALN, which was used to estimate WAG. To estimate our new matrix (called LG after the authors), we use an adaptation of the XRATE software and 3,912 alignments from Pfam, comprising approximately 50,000 sequences and approximately 6.5 million residues overall. To evaluate the LG performance, we use an independent sample consisting of 59 alignments from TreeBase and randomly divide Pfam alignments into 3,412 training and 500 test alignments. The comparison with WAG and JTT shows a clear likelihood improvement. With TreeBase, we find that 1) the average Akaike information criterion gain per site is 0.25 and 0.42, when compared with WAG and JTT, respectively; 2) LG is significantly better than WAG for 38 alignments (among 59), and significantly worse with 2 alignments only; and 3) tree topologies inferred with LG, WAG, and JTT frequently differ, indicating that using LG impacts not only the likelihood value but also the output tree. Results with the test alignments from Pfam are analogous. LG and a PHYML implementation can be downloaded from http://atgc.lirmm.fr/LG.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号