首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Rasmussen TK  Krink T 《Bio Systems》2003,72(1-2):5-17
Multiple sequence alignment (MSA) is one of the basic problems in computational biology. Realistic problem instances of MSA are computationally intractable for exact algorithms. One way to tackle MSA is to use Hidden Markov Models (HMMs), which are known to be very powerful in the related problem domain of speech recognition. However, the training of HMMs is computationally hard and there is no known exact method that can guarantee optimal training within reasonable computing time. Perhaps the most powerful training method is the Baum-Welch algorithm, which is fast, but bears the problem of stagnation at local optima. In the study reported in this paper, we used a hybrid algorithm combining particle swarm optimization with evolutionary algorithms to train HMMs for the alignment of protein sequences. Our experiments show that our approach yields better alignments for a set of benchmark protein sequences than the most commonly applied HMM training methods, such as Baum-Welch and Simulated Annealing.  相似文献   

2.
MOTIVATION: We propose representing amino acids by bit-patterns so they may be used in a filter algorithm for similarity searches over protein databases, to rapidly eliminate non-homologous regions of database sequences. The filter algorithm would be based on dynamic programming optimization. It would have the advantage over previous filter algorithms that its substitution scoring function distinguishes between conservative and non-conservative amino acid substitutions. RESULTS: Simulated annealing was used to search for the best five-bit or three-bit patterns to represent amino acids, where similar amino acids were given similar bit-patterns. The similarity between amino acids was estimated from the BLOSUM45 matrix. Representing amino acids by these five-bit and three-bit patterns, the Escherichia coli PhoE precursor and the bacteriophage PA2 LC precursor were aligned. The alignments were nearly the same as that obtained when BLOSUM45 was used to score substitutions. AVAILABILITY: The C code of the optimization algorithm for searching for the optimal bit-pattern representation of amino acids is available from the authors upon request.  相似文献   

3.
Crystal structure data of globular proteins were used to prepare (phi, psi) probability maps of 20 proteinous amino acids. These maps were compared grid-wise with each other and a conformational similarity index was calculated for each pair of amino acids. A weight matrix, called Conformational Similarity Weight (CSW) matrix, was prepared using the conformational similarity index. This weight matrix was used to align sequences of 21 pairs of proteins whose crystal structures are known. The aligned regions with more than seven contiguous amino acids were further analysed by plotting average weight (W) values of overlapping hepatapeptides in these regions and carrying out curve fitting by Fourier series having TEN harmonics. The protein fragments corresponding to the half-linewidth of peaks were predicted as fragments having similar conformation in the protein pair under consideration. Such an approach allows us to pick up conformationally similar protein fragments with more than 67% accuracy.  相似文献   

4.
Abstract

Protein sequences are treated as stochastic processes on the basis of a reduced amino acid alphabet of 10 types of amino acids. The realization of a stochastic process is described by associated transition probability matrix that corresponds to the process uniquely. Then new distances between transition probability matrices are defined for sequences similarity analysis. Two separate datasets are prepared and tested to identify the validity of the method. The results demonstrate the new method is powerful and efficient.  相似文献   

5.
Evaluation measures of multiple sequence alignments.   总被引:1,自引:0,他引:1  
Multiple sequence alignments (MSAs) are frequently used in the study of families of protein sequences or DNA/RNA sequences. They are a fundamental tool for the understanding of the structure, functionality and, ultimately, the evolution of proteins. A new algorithm, the Circular Sum (CS) method, is presented for formally evaluating the quality of an MSA. It is based on the use of a solution to the Traveling Salesman Problem, which identifies a circular tour through an evolutionary tree connecting the sequences in a protein family. With this approach, the calculation of an evolutionary tree and the errors that it would introduce can be avoided altogether. The algorithm gives an upper bound, the best score that can possibly be achieved by any MSA for a given set of protein sequences. Alternatively, if presented with a specific MSA, the algorithm provides a formal score for the MSA, which serves as an absolute measure of the quality of the MSA. The CS measure yields a direct connection between an MSA and the associated evolutionary tree. The measure can be used as a tool for evaluating different methods for producing MSAs. A brief example of the last application is provided. Because it weights all evolutionary events on a tree identically, but does not require the reconstruction of a tree, the CS algorithm has advantages over the frequently used sum-of-pairs measures for scoring MSAs, which weight some evolutionary events more strongly than others. Compared to other weighted sum-of-pairs measures, it has the advantage that no evolutionary tree must be constructed, because we can find a circular tour without knowing the tree.  相似文献   

6.
Information about conformational properties of a protein is contained in the hydrophobicity values of the amino acids in its primary sequence. We have investigated the possibility of extracting meaningful evolutionary information from the comparison of the hydrophobicity values of the corresponding amino acids in the sequences of homologous proteins. Distance matrices for six families of homologous proteins were made on the basis of the differences in hydrophobicity values of the amino acids. The phylogenetic trees constructed from such matrices were at least as good (as judged from their faithful reflection of evolutionary relationships), as trees constructed from the usual minimum mutation distance matrix.  相似文献   

7.
Annotation of the rapidly accumulating body of sequence data relies heavily on the detection of remote homologues and functional motifs in protein families. The most popular methods rely on sequence alignment. These include programs that use a scoring matrix to compare the probability of a potential alignment with random chance and programs that use curated multiple alignments to train profile hidden Markov models (HMMs). Related approaches depend on bootstrapping multiple alignments from a single sequence. However, alignment-based programs have limitations. They make the assumption that contiguity is conserved between homologous segments, which may not be true in genetic recombination or horizontal transfer. Alignments also become ambiguous when sequence similarity drops below 40%. This has kindled interest in classification methods that do not rely on alignment. An approach to classification without alignment based on the distribution of contiguous sequences of four amino acids (4-grams) was developed. Interest in 4-grams stemmed from the observation that almost all theoretically possible 4-grams (20(4)) occur in natural sequences and the majority of 4-grams are uniformly distributed. This implies that the probability of finding identical 4-grams by random chance in unrelated sequences is low. A Bayesian probabilistic model was developed to test this hypothesis. For each protein family in Pfam-A and PIR-PSD, a feature vector called a probe was constructed from the set of 4-grams that best characterised the family. In rigorous jackknife tests, unknown sequences from Pfam-A and PIR-PSD were compared with the probes for each family. A classification result was deemed a true positive if the probe match with the highest probability was in first place in a rank-ordered list. This was achieved in 70% of cases. Analysis of false positives suggested that the precision might approach 85% if selected families were clustered into subsets. Case studies indicated that the 4-grams in common between an unknown and the best matching probe correlated with functional motifs from PRINTS. The results showed that remote homologues and functional motifs could be identified from an analysis of 4-gram patterns.  相似文献   

8.
Aligning amino acid sequences: comparison of commonly used methods   总被引:5,自引:0,他引:5  
We examined two extensive families of protein sequences using four different alignment schemes that employ various degrees of "weighting" in order to determine which approach is most sensitive in establishing relationships. All alignments used a similarity approach based on a general algorithm devised by Needleman and Wunsch. The approaches included a simple program, UM (unitary matrix), whereby only identities are scored; a scheme in which the genetic code is used as a basis for weighting (GC); another that employs a matrix based on structural similarity of amino acids taken together with the genetic basis of mutation (SG); and a fourth that uses the empirical log-odds matrix (LOM) developed by Dayhoff on the basis of observed amino acid replacements. The two sequence families examined were (a) nine different globins and (b) nine different tyrosine kinase-like proteins. It was assumed a priori that all members of a family share common ancestry. In cases where two sequences were more than 30% identical, alignments by all four methods were almost always the same. In cases where the percentage identity was less than 20%, however, there were often significant differences in the alignments. On the average, the Dayhoff LOM approach was the most effective in verifying distant relationships, as judged by an empirical "jumbling test." This was not universally the case, however, and in some instances the simple UM was actually as good or better. Trees constructed on the basis of the various alignments differed with regard to their limb lengths, but had essentially the same branching orders. We suggest some reasons for the different effectivenesses of the four approaches in the two different sequence settings, and offer some rules of thumb for assessing the significance of sequence relationships.  相似文献   

9.
A new method based on neural networks to cluster proteins into families is described. The network is trained with the Kohonen unsupervised learning algorithm, using matrix pattern representations of the protein sequences as inputs. The components (x, y) of these 20×20 matrix patterns are the normalized frequencies of all pairs xy of amino acids in each sequence. We investigate the influence of different learning parameters in the final topological maps obtained with a learning set of ten proteins belonging to three established families. In all cases, except in those where the synaptic vectors remains nearly unchanged during learning, the ten proteins are correctly classified into the expected families. The classification by the trained network of mutated or incomplete sequences of the learned proteins is also analysed. The neural network gives a correct classification for a sequence mutated in 21.5%±7% of its amino acids and for fragments representing 7.5%±3% of the original sequence. Similar results were obtained with a learning set of 32 proteins belonging to 15 families. These results show that a neural network can be trained following the Kohonen algorithm to obtain topological maps of protein sequences, where related proteins are finally associated to the same winner neuron or to neighboring ones, and that the trained network can be applied to rapidly classify new sequences. This approach opens new possibilities to find rapid and efficient algorithms to organize and search for homologies in the whole protein database.  相似文献   

10.
MOTIVATION: Profile hidden Markov models provide a sensitive method for performing sequence database search and aligning multiple sequences. One of the drawbacks of the hidden Markov model is that the conserved amino acids are not emphasized, but signal and noise are treated equally. For this reason, the number of estimated emission parameters is often enormous. Focusing the analysis on conserved residues only should increase the accuracy of sequence database search. RESULTS: We address this issue with a new method for efficient emission probability (EEP) estimation, in which amino acids are divided into effective and ineffective residues at each conserved alignment position. A practical study with 20 protein families demonstrated that the EEP method is capable of detecting family members from other proteins with sensitivity of 98% and specificity of 99% on the average, even if the number of free emission parameters was decreased to 15% of the original. In the database search for TIM barrel sequences, EEP recognizes the family members nearly as accurately as HMMER or Blast, but the number of false positive sequences was significantly less than that obtained with the other methods. AVAILABILITY: The algorithms written in C language are available on request from the authors.  相似文献   

11.
Sequence alignment is a standard method for the estimation of the evolutionary, structural, and functional relationships among amino acid sequences. The quality of alignments depends on the used similarity matrix. Statistical contact potentials (CPs) contain information on contact propensities among residues in native protein structures. Substitution matrices (SMs) based on CPs are applicable for the comparison of distantly related sequences. Here, contact between amino acids was estimated on the basis of the evaluation of the distances between side-chain terminal groups (SCTGs), which are defined as the group of the side-chain heavy atoms with fixed distances between them. In this paper, two new types of CPs and similarity matrices have been constructed: one based on fixed cutoff distance obtained from geometric characteristics of the SCTGs (TGC1), while the other is distance-dependent potential (TGC2). These matrices are compared with other popular SMs. The performance of the matrices was evaluated by comparing sequence with structural alignments. The obtained results show that TGC2 has the best performance among contact-based matrices, but on the whole, contact-based matrices have slightly lower performance than other SMs except fold-level similarity.  相似文献   

12.
Sequence alignment is a standard method for the estimation of the evolutionary, structural, and functional relationships among amino acid sequences. The quality of alignments depends on the used similarity matrix. Statistical contact potentials (CPs) contain information on contact propensities among residues in native protein structures. Substitution matrices (SMs) based on CPs are applicable for the comparison of distantly related sequences. Here, contact between amino acids was estimated on the basis of the evaluation of the distances between side-chain terminal groups (SCTGs), which are defined as the group of the side-chain heavy atoms with fixed distances between them. In this paper, two new types of CPs and similarity matrices have been constructed: one based on fixed cutoff distance obtained from geometric characteristics of the SCTGs (TGC1), while the other is distance-dependent potential (TGC2). These matrices are compared with other popular SMs. The performance of the matrices was evaluated by comparing sequence with structural alignments. The obtained results show that TGC2 has the best performance among contact-based matrices, but on the whole, contact-based matrices have slightly lower performance than other SMs except fold-level similarity.  相似文献   

13.
A widely used algorithm for computing an optimal local alignment between two sequences requires a parameter set with a substitution matrix and gap penalties. It is recognized that a proper parameter set should be selected to suit the level of conservation between sequences. We describe an algorithm for selecting an appropriate substitution matrix at given gap penalties for computing an optimal local alignment between two sequences. In the algorithm, a substitution matrix that leads to the maximum alignment similarity score is selected among substitution matrices at various evolutionary distances. The evolutionary distance of the selected substitution matrix is defined as the distance of the computed alignment. To show the effects of gap penalties on alignments and their distances and help select appropriate gap penalties, alignments and their distances are computed at various gap penalties. The algorithm has been implemented as a computer program named SimDist. The SimDist program was compared with an existing local alignment program named SIM for finding reciprocally best-matching pairs (RBPs) of sequences in each of 100 protein families, where RBPs are commonly used as an operational definition of orthologous sequences. SimDist produced more accurate results than SIM on 50 of the 100 families, whereas both programs produced the same results on the other 50 families. SimDist was also used to compare three types of substitution matrices in scoring 444,461 pairs of homologous sequences from the 100 families.  相似文献   

14.
M Pieber  J Tohá 《Origins of life》1983,13(2):139-146
The frequency of amino acid replacements in families of typical proteins has been elegantly analyzed by Argyle (1980) showing that the most frequent replacements involve a conservation of the amino acid chemical properties. The cyclic arrangement of the twenty amino acids resulting from the most frequent replacements has been described as an amino acid chemical ring. In this work, a novel amino acid replacement frequency ring is proposed, for which a conservation of over 90% of the most general physico-chemical properties can be deduced. The amino acid chemical similarity ring is also analyzed in terms of the genetic code base probability changes, showing that the discrepancy that exists between the standard deviation value of the amino acid replacement frequency matrix and its respective ideal value is almost equal to that deduced from the corresponding base codon replacement probability matrices. These differences are finally evaluated and discussed in terms of the restrictions imposed by the structure of the genetic code and the physico-chemical dissimilarities between some codons of amino acids which are chemically similar.  相似文献   

15.
We have recently described a method based on artificial neural networks to cluster protein sequences into families. The network was trained with Kohonen''s unsupervised learning algorithm using, as inputs, the matrix patterns derived from the dipeptide composition of the proteins. We present here a large-scale application of that method to classify the 1,758 human protein sequences stored in the SwissProt database (release 19.0), whose lengths are greater than 50 amino acids. In the final 2-dimensional topologically ordered map of 15 x 15 neurons, proteins belonging to known families were associated with the same neuron or with neighboring ones. Also, as an attempt to reduce the time-consuming learning procedure, we compared 2 learning protocols: one of 500 epochs (100 SUN CPU-hours [CPU-h]), and another one of 30 epochs (6.7 CPU-h). A further reduction of learning-computing time, by a factor of about 3.3, with similar protein clustering results, was achieved using a matrix of 11 x 11 components to represent the sequences. Although network training is time consuming, the classification of a new protein in the final ordered map is very fast (14.6 CPU-seconds). We also show a comparison between the artificial neural network approach and conventional methods of biosequence analysis.  相似文献   

16.
We propose a model that explains the hierarchical organization of proteins in fold families. The model, which is based on the evolutionary selection of proteins by their native state stability, reproduces patterns of amino acids conserved across protein families. Due to its dynamic nature, the model sheds light on the evolutionary time-scales. By studying the relaxation of the correlation function between consecutive mutations at a given position in proteins, we observe separation of the evolutionary time-scales: at short time intervals families of proteins with similar sequences and structures are formed, while at long time intervals the families of structurally similar proteins that have low sequence similarity are formed. We discuss the evolutionary implications of our model. We provide a "profile" solution to our model and find agreement between predicted patterns of conserved amino acids and those actually observed in nature.  相似文献   

17.

Background  

The chemical property and biological function of a protein is a direct consequence of its primary structure. Several algorithms have been developed which determine alignment and similarity of primary protein sequences. However, character based similarity cannot provide insight into the structural aspects of a protein. We present a method based on spectral similarity to compare subsequences of amino acids that behave similarly but are not aligned well by considering amino acids as mere characters. This approach finds a similarity score between sequences based on any given attribute, like hydrophobicity of amino acids, on the basis of spectral information after partial conversion to the frequency domain.  相似文献   

18.
Kawabata T  Nishikawa K 《Proteins》2000,41(1):108-122
A number of automatic protein structure comparison methods have been proposed; however, their similarity score functions are often decided by the researchers' intuition and trial-and-error, and not by theoretical background. We propose a novel theory to evaluate protein structure similarity, which is based on the Markov transition model of evolution. Our similarity score between structures i and j is defined as log P(j --> i)/P(i), where P(j --> i) is the probability that structure j changes to structure i during the evolutionary process, and P(i) is the probability that structure i appears by chance. This is a reasonable definition of structure similarity, especially for finding evolutionarily related (homologous) similarity. The probability P(j --> i) is estimated by the Markov transition model, which is similar to the Dayhoff's substitution model between amino acids. To estimate the parameters of the model, homologous protein structure pairs are collected using sequence similarity, and the numbers of structure transitions within the pairs are counted. Next these numbers are transformed to a transition probability matrix of the Markov transition. Transition probabilities for longer time are obtained by multiplying the probability matrix by itself several times. In this study, we generated three types of structure similarity scores: an environment score, a residue-residue distance score, and a secondary structure elements (SSE) score. Using these scores, we developed the structure comparison program, Matras (MArkovian TRAnsition of protein Structure). It employs a hierarchical alignment algorithm, in which a rough alignment is first obtained by SSEs, and then is improved with more detailed functions. We attempted an all-versus-all comparison of the SCOP database, and evaluated its ability to recognize a superfamily relationship, which was manually assigned to be homologous in the SCOP database. A comparison with the FSSP database shows that our program can recognize more homologous similarity than FSSP. We also discuss the reliability of our method, by studying the disagreement between structural classifications by Matras and SCOP.  相似文献   

19.
Tillier ER  Biro L  Li G  Tillo D 《Proteins》2006,63(4):822-831
Approaches for the determination of interacting partners from different protein families (such as ligands and their receptors) have made use of the property that interacting proteins follow similar patterns and relative rates of evolution. Interacting protein partners can then be predicted from the similarity of their phylogenetic trees or evolutionary distances matrices. We present a novel method called Codep, for the determination of interacting protein partners by maximizing co-evolutionary signals. The order of sequences in the multiple sequence alignments from two protein families is determined in such a manner as to maximize the similarity of substitution patterns at amino acid sites in the two alignments and, thus, phylogenetic congruency. This is achieved by maximizing the total number of interdependencies of amino acids sites between the alignments. Once ordered, the corresponding sequences in the two alignments indicate the predicted interacting partners. We demonstrate the efficacy of this approach with computer simulations and in analyses of several protein families. A program implementing our method, Codep, is freely available to academic users from our website: http://www.uhnresearch.ca/labs/tillier/.  相似文献   

20.
Characterization of protein primary sequences based on partial ordering   总被引:1,自引:0,他引:1  
In this paper, we present a new approach to characterize protein sequences. Based on orderings of the 20 natural amino acids which reflect some of their physico-chemical properties, we construct an augmented Hasse matrix for each protein sequence. Furthermore, the normalized leading eigenvalues of these matrices are computed and considered as invariants for the protein sequences. Finally, we make a comparison for the similarity/diversity of nine different protein sequences.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号