共查询到20条相似文献,搜索用时 15 毫秒
1.
On reduced amino acid alphabets for phylogenetic inference 总被引:1,自引:0,他引:1
We investigate the use of Markov models of evolution for reduced amino acid alphabets or bins of amino acids. The use of reduced amino acid alphabets can ameliorate effects of model misspecification and saturation. We present algorithms for 2 different ways of automating the construction of bins: minimizing criteria based on properties of rate matrices and minimizing criteria based on properties of alignments. By simulation, we show that in the absence of model misspecification, the loss of information due to binning is found to be insubstantial, and the use of Markov models at the binned level is found to be almost as effective as the more appropriate missing data approach. By applying these approaches to real data sets where compositional heterogeneity and/or saturation appear to be causing biased tree estimation, we find that binning can improve topological estimation in practice. 相似文献
2.
Reduced or simplified amino acid alphabets group the 20 naturally occurring amino acids into a smaller number of representative protein residues. To date, several reduced amino acid alphabets have been proposed, which have been derived and optimized by a variety of methods. The resulting reduced amino acid alphabets have been applied to pattern recognition, generation of consensus sequences from multiple alignments, protein folding, and protein structure prediction. In this work, amino acid substitution matrices and statistical potentials were derived based on several reduced amino acid alphabets and their performance assessed in a large benchmark for the tasks of sequence alignment and fold assessment of protein structure models, using as a reference frame the standard alphabet of 20 amino acids. The results showed that a large reduction in the total number of residue types does not necessarily translate into a significant loss of discriminative power for sequence alignment and fold assessment. Therefore, some definitions of a few residue types are able to encode most of the relevant sequence/structure information that is present in the 20 standard amino acids. Based on these results, we suggest that the use of reduced amino acid alphabets may allow to increasing the accuracy of current substitution matrices and statistical potentials for the prediction of protein structure of remote homologs. 相似文献
3.
Background
Phylogenetic analysis can be used to divide a protein family into subfamilies in the absence of experimental information. Most phylogenetic analysis methods utilize multiple alignment of sequences and are based on an evolutionary model. However, multiple alignment is not an automated procedure and requires human intervention to maintain alignment integrity and to produce phylogenies consistent with the functional splits in underlying sequences. To address this problem, we propose to use the alignment-free Relative Complexity Measure (RCM) combined with reduced amino acid alphabets to cluster protein families into functional subtypes purely on sequence criteria. Comparison with an alignment-based approach was also carried out to test the quality of the clustering. 相似文献4.
Intrinsically disordered regions (IDR) play an important role in key biological processes and are closely related to human diseases. IDRs have great potential to serve as targets for drug discovery, most notably in disordered binding regions. Accurate prediction of IDRs is challenging because their genome wide occurrence and a low ratio of disordered residues make them difficult targets for traditional classification techniques. Existing computational methods mostly rely on sequence profiles to improve accuracy which is time consuming and computationally expensive. This article describes an ab initio sequence-only prediction method—which tries to overcome the challenge of accurate prediction posed by IDRs—based on reduced amino acid alphabets and convolutional neural networks (CNNs). We experiment with six different 3-letter reduced alphabets. We argue that the dimensional reduction in the input alphabet facilitates the detection of complex patterns within the sequence by the convolutional step. Experimental results show that our proposed IDR predictor performs at the same level or outperforms other state-of-the-art methods in the same class, achieving accuracy levels of 0.76 and AUC of 0.85 on the publicly available Critical Assessment of protein Structure Prediction dataset (CASP10). Therefore, our method is suitable for proteome-wide disorder prediction yielding similar or better accuracy than existing approaches at a faster speed. 相似文献
5.
In this study, n-peptide compositions are utilized for protein vectorization over a discriminative remote homology detection framework based on support vector machines (SVMs). The size of amino acid alphabet is gradually reduced for increasing values of n to make the method to conform with the memory resources in conventional workstations. A hash structure is implemented for accelerated search of n-peptides. The method is tested to see its ability to classify proteins into families on a subset of SCOP family database and compared against many of the existing homology detection methods including the most popular generative methods; SAM-98 and PSI-BLAST and the recent SVM methods; SVM-Fisher, SVM-BLAST and SVM-Pairwise. The results have demonstrated that the new method significantly outperforms SVM-Fisher, SVM-BLAST, SAM-98 and PSI-BLAST, while achieving a comparable accuracy with SVM-Pairwise. In terms of efficiency, it performs much better than SVM-Pairwise. It is shown that the information of n-peptide compositions with reduced amino acid alphabets provides an accurate and efficient means of protein vectorization for SVM-based sequence classification. 相似文献
6.
Local homology recognition and distance measures in linear time using compressed amino acid alphabets 总被引:1,自引:0,他引:1
Edgar RC 《Nucleic acids research》2004,32(1):380-385
Methods for discovery of local similarities and estimation of evolutionary distance by identifying k-mers (contiguous subsequences of length k) common to two sequences are described. Given unaligned sequences of length L, these methods have O(L) time complexity. The ability of compressed amino acid alphabets to extend these techniques to distantly related proteins was investigated. The performance of these algorithms was evaluated for different alphabets and choices of k using a test set of 1848 pairs of structurally alignable sequences selected from the FSSP database. Distance measures derived from k-mer counting were found to correlate well with percentage identity derived from sequence alignments. Compressed alphabets were seen to improve performance in local similarity discovery, but no evidence was found of improvements when applied to distance estimates. The performance of our local similarity discovery method was compared with the fast Fourier transform (FFT) used in MAFFT, which has O(L log L) time complexity. The method for achieving comparable coverage to FFT is revealed here, and is more than an order of magnitude faster. We suggest using k-mer distance for fast, approximate phylogenetic tree construction, and show that a speed improvement of more than three orders of magnitude can be achieved relative to standard distance methods, which require alignments. 相似文献
7.
Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids 下载免费PDF全文
Sequence alignment is a common method for finding protein structurally conserved/similar regions. However, sequence alignment
is often not accurate if sequence identities between to-be-aligned sequences are less than 30%. This is because that for these
sequences, different residues may play similar structural roles and they are incorrectly aligned during the sequence alignment
using substitution matrix consisting of 20 types of residues. Based on the similarity of physicochemical features, residues
can be clustered into a few groups. Using such simplified alphabets, the complexity of protein sequences is reduced and at
the same time the key information encoded in the sequences remains. As a result, the accuracy of sequence alignment might
be improved if the residues are properly clustered. Here, by using a database of aligned protein structures (DAPS), a new
clustering method based on the substitution scores is proposed for the grouping of residues, and substitution matrices of
residues at different levels of simplification are constructed. The validity of the reduced alphabets is confirmed by relative
entropy analysis. The reduced alphabets are applied to recognition of protein structurally conserved/similar regions by sequence
alignment. The results indicate that the accuracy or efficiency of sequence alignment can be improved with the optimal reduced
alphabet with N around 9. 相似文献
8.
Sequence alignment is a common method for finding protein structurally conserved/similar regions. However, sequence alignment
is often not accurate if sequence identities between to-be-aligned sequences are less than 30%. This is because that for these
sequences, different residues may play similar structural roles and they are incorrectly aligned during the sequence alignment
using substitution matrix consisting of 20 types of residues. Based on the similarity of physicochemical features, residues
can be clustered into a few groups. Using such simplified alphabets, the complexity of protein sequences is reduced and at
the same time the key information encoded in the sequences remains. As a result, the accuracy of sequence alignment might
be improved if the residues are properly clustered. Here, by using a database of aligned protein structures (DAPS), a new
clustering method based on the substitution scores is proposed for the grouping of residues, and substitution matrices of
residues at different levels of simplification are constructed. The validity of the reduced alphabets is confirmed by relative
entropy analysis. The reduced alphabets are applied to recognition of protein structurally conserved/similar regions by sequence
alignment. The results indicate that the accuracy or efficiency of sequence alignment can be improved with the optimal reduced
alphabet with N around 9.
Supported by the National Natural Science Foundation of China (Grant Nos. 90403120, 10474041 and 10021001) and the Nonlinear
Project (973) of the NSM 相似文献
9.
《中国科学:生命科学英文版》2007,(3)
Sequence alignment is a common method for finding protein structurally conserved/similar regions. However, sequence alignment is often not accurate if sequence identities between to-be-aligned se- quences are less than 30%. This is because that for these sequences, different residues may play similar structural roles and they are incorrectly aligned during the sequence alignment using substitu- tion matrix consisting of 20 types of residues. Based on the similarity of physicochemical features, residues can be clustered into a few groups. Using such simplified alphabets, the complexity of protein sequences is reduced and at the same time the key information encoded in the sequences remains. As a result, the accuracy of sequence alignment might be improved if the residues are properly clustered. Here, by using a database of aligned protein structures (DAPS), a new clustering method based on the substitution scores is proposed for the grouping of residues, and substitution matrices of residues at different levels of simplification are constructed. The validity of the reduced alphabets is confirmed by relative entropy analysis. The reduced alphabets are applied to recognition of protein structurally conserved/similar regions by sequence alignment. The results indicate that the accuracy or efficiency of sequence alignment can be improved with the optimal reduced alphabet with N around 9. 相似文献
10.
Simplified amino acid alphabets for protein fold recognition and implications for folding 总被引:6,自引:0,他引:6
Protein design experiments have shown that the use of specific subsets of amino acids can produce foldable proteins. This prompts the question of whether there is a minimal amino acid alphabet which could be used to fold all proteins. In this work we make an analogy between sequence patterns which produce foldable sequences and those which make it possible to detect structural homologs by aligning sequences, and use it to suggest the possible size of such a reduced alphabet. We estimate that reduced alphabets containing 10-12 letters can be used to design foldable sequences for a large number of protein families. This estimate is based on the observation that there is little loss of the information necessary to pick out structural homologs in a clustered protein sequence database when a suitable reduction of the amino acid alphabet from 20 to 10 letters is made, but that this information is rapidly degraded when further reductions in the alphabet are made. 相似文献
11.
Pattern recognition receptors (PRRs) play a key role in the innate immune response by recognizing pathogen associated molecular patterns derived from a diverse collection of microbial pathogens. PRRs form a superfamily of proteins related to host health and disease. Thus, prediction of PRR family might supply biologically significant information for functional annotation of PRRs and development of novel drugs. In this paper, a computational method is proposed for predicting the families of PRRs. The prediction was performed on the basis of amino acid composition and pseudo-amino acid composition (PseAAC) from primary sequences of proteins using support vector machines. A non-redundant dataset consisted of 332 PRRs in seven families was constructed to do training and testing. It was demonstrated that different families of PRRs were quite closely correlated with amino acid composition as well as PseAAC. In the jackknife test, overall accuracies of amino acid composition-based and PseAAC-based classifiers reached 96.1% and 97.9%, respectively. The results indicate that families of PRRs are predictable with high accuracy. It is anticipated that this computational method might be a powerful tool for the automated assignment of families of PRRs. 相似文献
12.
Knowledge of the types of membrane protein provides useful clues in deducing the functions of uncharacterized membrane proteins. An automatic method for efficiently identifying uncharacterized proteins is thus highly desirable. In this work, we have developed a novel method for predicting membrane protein types by exploiting the discrimination capability of the difference in amino acid composition at the N and C terminus through split amino acid composition (SAAC). We also show that the ensemble classification can better exploit this discriminating capability of SAAC. In this study, membrane protein types are classified using three feature extraction and several classification strategies. An ensemble classifier Mem-EnsSAAC is then developed using the best feature extraction strategy. Pseudo amino acid (PseAA) composition, discrete wavelet analysis (DWT), SAAC, and a hybrid model are employed for feature extraction. The nearest neighbor, probabilistic neural network, support vector machine, random forest, and Adaboost are used as individual classifiers. The predicted results of the individual learners are combined using genetic algorithm to form an ensemble classifier, Mem-EnsSAAC yielding an accuracy of 92.4 and 92.2% for the Jackknife and independent dataset test, respectively. Performance measures such as MCC, sensitivity, specificity, F-measure, and Q-statistics show that SAAC-based prediction yields significantly higher performance compared to PseAA- and DWT-based systems, and is also the best reported so far. The proposed Mem-EnsSAAC is able to predict the membrane protein types with high accuracy and consequently, can be very helpful in drug discovery. It can be accessed at http://111.68.99.218/membrane. 相似文献
13.
14.
Identification of antigenic sites on proteins is of vital importance for developing synthetic peptide vaccines, immunodiagnostic tests and antibody production. Currently, most of the prediction algorithms rely on amino acid propensity scales using a sliding window approach. These methods are oversimplified and yield poor predicted results in practice. In this paper, a novel scale, called the amino acid pair (AAP) antigenicity scale, is proposed that is based on the finding that B-cell epitopes favor particular AAPs. It is demonstrated that, using SVM (support vector machine) classifier, the AAP antigenicity scale approach has much better performance than the existing scales based on the single amino acid propensity. The AAP antigenicity scale can reflect some special sequence-coupled feature in the B-cell epitopes, which is the essence why the new approach is superior to the existing ones. It is anticipated that with the continuous increase of the known epitope data, the power of the AAP antigenicity scale approach will be further enhanced. 相似文献
15.
MOTIVATION: Protein and DNA are generally represented by sequences of letters. In a number of circumstances simplified alphabets (where one or more letters would be represented by the same symbol) have proved their potential utility in several fields of bioinformatics including searching for patterns occurring at an unexpected rate, studying protein folding and finding consensus sequences in multiple alignments. The main issue addressed in this paper is the possibility of finding a general approach that would allow an exhaustive analysis of all the possible simplified alphabets, using substitution matrices like PAM and BLOSUM as a measure for scoring. RESULTS: The computational approach presented in this paper has led to a computer program called AlphaSimp (Alphabet Simplifier) that can perform an exhaustive analysis of the possible simplified amino acid alphabets, using a branch and bound algorithm together with standard or user-defined substitution matrices. The program returns a ranked list of the highest-scoring simplified alphabets. When the extent of the simplification is limited and the simplified alphabets are maintained above ten symbols the program is able to complete the analysis in minutes or even seconds on a personal computer. However, the performance becomes worse, taking up to several hours, for highly simplified alphabets. AVAILABILITY: AlphaSimp and other accessory programs are available at http://bioinformatics.cribi.unipd.it/alphasimp 相似文献
16.
17.
We evaluated the prediction of beta-turns from amino acid sequences using the residue-coupled model with an enlarged representative protein data set selected from the Protein Data Bank. Our results show that the probability values derived from a data set comprising 425 protein chains yielded an overall beta-turn prediction accuracy 68.74%, compared with 94.7% reported earlier on a data set of 30 proteins using the same method. However, we noted that the overall beta-turn prediction accuracy using probability values derived from the 30-protein data set reduces to 40.74% when tested on the data set comprising 425 protein chains. In contrast, using probability values derived from the 425 data set used in this analysis, the overall beta-turn prediction accuracy yielded consistent results when tested on either the 30-protein data set (64.62%) used earlier or a more recent representative data set comprising 619 protein chains (64.66%) or on a jackknife data set comprising 476 representative protein chains (63.38%). We therefore recommend the use of probability values derived from the 425 representative protein chains data set reported here, which gives more realistic and consistent predictions of beta-turns from amino acid sequences. 相似文献
18.
Raha K Wollacott AM Italia MJ Desjarlais JR 《Protein science : a publication of the Protein Society》2000,9(6):1106-1119
We have developed a method for the prediction of an amino acid sequence that is compatible with a three-dimensional backbone structure. Using only a backbone structure of a protein as input, the algorithm is capable of designing sequences that closely resemble natural members of the protein family to which the template structure belongs. In general, the predicted sequences are shown to have multiple sequence profile scores that are dramatically higher than those of random sequences, and sometimes better than some of the natural sequences that make up the superfamily. As anticipated, highly conserved but poorly predicted residues are often those that contribute to the functional rather than structural properties of the protein. Overall, our analysis suggests that statistical profile scores of designed sequences are a novel and valuable figure of merit for assessing and improving protein design algorithms. 相似文献
19.
As one of the most important reversible protein post-translation modifications, ubiquitination has been reported to be involved in lots of biological processes and closely implicated with various diseases. To fully decipher the molecular mechanisms of ubiquitination-related biological processes, an initial but crucial step is the recognition of ubiquitylated substrates and the corresponding ubiquitination sites. Here, a new bioinformatics tool named CKSAAP_UbSite was developed to predict ubiquitination sites from protein sequences. With the assistance of Support Vector Machine (SVM), the highlight of CKSAAP_UbSite is to employ the composition of k-spaced amino acid pairs surrounding a query site (i.e. any lysine in a query sequence) as input. When trained and tested in the dataset of yeast ubiquitination sites (Radivojac et al, Proteins, 2010, 78: 365-380), a 100-fold cross-validation on a 1∶1 ratio of positive and negative samples revealed that the accuracy and MCC of CKSAAP_UbSite reached 73.40% and 0.4694, respectively. The proposed CKSAAP_UbSite has also been intensively benchmarked to exhibit better performance than some existing predictors, suggesting that it can be served as a useful tool to the community. Currently, CKSAAP_UbSite is freely accessible at http://protein.cau.edu.cn/cksaap_ubsite/. Moreover, we also found that the sequence patterns around ubiquitination sites are not conserved across different species. To ensure a reasonable prediction performance, the application of the current CKSAAP_UbSite should be limited to the proteome of yeast. 相似文献
20.
Knowing protein structure and inferring its function from the structure are one of the main issues of computational structural biology, and often the first step is studying protein secondary structure. There have been many attempts to predict protein secondary structure contents. Previous attempts assumed that the content of protein secondary structure can be predicted successfully using the information on the amino acid composition of a protein. Recent methods achieved remarkable prediction accuracy by using the expanded composition information. The overall average error of the most successful method is 3.4%. Here, we demonstrate that even if we only use the simple amino acid composition information alone, it is possible to improve the prediction accuracy significantly if the evolutionary information is included. The idea is motivated by the observation that evolutionarily related proteins share the similar structure. After calculating the homolog-averaged amino acid composition of a protein, which can be easily obtained from the multiple sequence alignment by running PSI-BLAST, those 20 numbers are learned by a multiple linear regression, an artificial neural network and a support vector regression. The overall average error of method by a support vector regression is 3.3%. It is remarkable that we obtain the comparable accuracy without utilizing the expanded composition information such as pair-coupled amino acid composition. This work again demonstrates that the amino acid composition is a fundamental characteristic of a protein. It is anticipated that our novel idea can be applied to many areas of protein bioinformatics where the amino acid composition information is utilized, such as subcellular localization prediction, enzyme subclass prediction, domain boundary prediction, signal sequence prediction, and prediction of unfolded segment in a protein sequence, to name a few. 相似文献