共查询到20条相似文献,搜索用时 15 毫秒
1.
Ketoacyl synthases are enzymes involved in fatty acid synthesis and can be classified into five families based on primary
sequence similarity. Different families have different catalytic mechanisms. Developing cost-effective computational models
to identify the family of ketoacyl synthases will be helpful for enzyme engineering and in knowing individual enzymes’ catalytic
mechanisms. In this work, a support vector machine-based method was developed to predict ketoacyl synthase family using the
n-peptide composition of reduced amino acid alphabets. In jackknife cross-validation, the model based on the 2-peptide composition
of a reduced amino acid alphabet of size 13 yielded the best overall accuracy of 96.44% with average accuracy of 93.36%, which
is superior to other state-of-the-art methods. This result suggests that the information provided by n-peptide compositions of reduced amino acid alphabets provides efficient means for enzyme family classification and that the
proposed model can be efficiently used for ketoacyl synthase family annotation. 相似文献
2.
On reduced amino acid alphabets for phylogenetic inference 总被引:1,自引:0,他引:1
We investigate the use of Markov models of evolution for reduced amino acid alphabets or bins of amino acids. The use of reduced amino acid alphabets can ameliorate effects of model misspecification and saturation. We present algorithms for 2 different ways of automating the construction of bins: minimizing criteria based on properties of rate matrices and minimizing criteria based on properties of alignments. By simulation, we show that in the absence of model misspecification, the loss of information due to binning is found to be insubstantial, and the use of Markov models at the binned level is found to be almost as effective as the more appropriate missing data approach. By applying these approaches to real data sets where compositional heterogeneity and/or saturation appear to be causing biased tree estimation, we find that binning can improve topological estimation in practice. 相似文献
3.
Reduced or simplified amino acid alphabets group the 20 naturally occurring amino acids into a smaller number of representative protein residues. To date, several reduced amino acid alphabets have been proposed, which have been derived and optimized by a variety of methods. The resulting reduced amino acid alphabets have been applied to pattern recognition, generation of consensus sequences from multiple alignments, protein folding, and protein structure prediction. In this work, amino acid substitution matrices and statistical potentials were derived based on several reduced amino acid alphabets and their performance assessed in a large benchmark for the tasks of sequence alignment and fold assessment of protein structure models, using as a reference frame the standard alphabet of 20 amino acids. The results showed that a large reduction in the total number of residue types does not necessarily translate into a significant loss of discriminative power for sequence alignment and fold assessment. Therefore, some definitions of a few residue types are able to encode most of the relevant sequence/structure information that is present in the 20 standard amino acids. Based on these results, we suggest that the use of reduced amino acid alphabets may allow to increasing the accuracy of current substitution matrices and statistical potentials for the prediction of protein structure of remote homologs. 相似文献
4.
Heat shock proteins (HSPs) are a type of functionally related proteins present in all living organisms, both prokaryotes and eukaryotes. They play essential roles in protein–protein interactions such as folding and assisting in the establishment of proper protein conformation and prevention of unwanted protein aggregation. Their dysfunction may cause various life-threatening disorders, such as Parkinson’s, Alzheimer’s, and cardiovascular diseases. Based on their functions, HSPs are usually classified into six families: (i) HSP20 or sHSP, (ii) HSP40 or J-class proteins, (iii) HSP60 or GroEL/ES, (iv) HSP70, (v) HSP90, and (vi) HSP100. Although considerable progress has been achieved in discriminating HSPs from other proteins, it is still a big challenge to identify HSPs among their six different functional types according to their sequence information alone. With the avalanche of protein sequences generated in the post-genomic age, it is highly desirable to develop a high-throughput computational tool in this regard. To take up such a challenge, a predictor called iHSP-PseRAAAC has been developed by incorporating the reduced amino acid alphabet information into the general form of pseudo amino acid composition. One of the remarkable advantages of introducing the reduced amino acid alphabet is being able to avoid the notorious dimension disaster or overfitting problem in statistical prediction. It was observed that the overall success rate achieved by iHSP-PseRAAAC in identifying the functional types of HSPs among the aforementioned six types was more than 87%, which was derived by the jackknife test on a stringent benchmark dataset in which none of HSPs included has ?40% pairwise sequence identity to any other in the same subset. It has not escaped our notice that the reduced amino acid alphabet approach can also be used to investigate other protein classification problems. As a user-friendly web server, iHSP-PseRAAAC is accessible to the public at http://lin.uestc.edu.cn/server/iHSP-PseRAAAC. 相似文献
5.
Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids 下载免费PDF全文
Sequence alignment is a common method for finding protein structurally conserved/similar regions. However, sequence alignment
is often not accurate if sequence identities between to-be-aligned sequences are less than 30%. This is because that for these
sequences, different residues may play similar structural roles and they are incorrectly aligned during the sequence alignment
using substitution matrix consisting of 20 types of residues. Based on the similarity of physicochemical features, residues
can be clustered into a few groups. Using such simplified alphabets, the complexity of protein sequences is reduced and at
the same time the key information encoded in the sequences remains. As a result, the accuracy of sequence alignment might
be improved if the residues are properly clustered. Here, by using a database of aligned protein structures (DAPS), a new
clustering method based on the substitution scores is proposed for the grouping of residues, and substitution matrices of
residues at different levels of simplification are constructed. The validity of the reduced alphabets is confirmed by relative
entropy analysis. The reduced alphabets are applied to recognition of protein structurally conserved/similar regions by sequence
alignment. The results indicate that the accuracy or efficiency of sequence alignment can be improved with the optimal reduced
alphabet with N around 9. 相似文献
6.
Sequence alignment is a common method for finding protein structurally conserved/similar regions. However, sequence alignment
is often not accurate if sequence identities between to-be-aligned sequences are less than 30%. This is because that for these
sequences, different residues may play similar structural roles and they are incorrectly aligned during the sequence alignment
using substitution matrix consisting of 20 types of residues. Based on the similarity of physicochemical features, residues
can be clustered into a few groups. Using such simplified alphabets, the complexity of protein sequences is reduced and at
the same time the key information encoded in the sequences remains. As a result, the accuracy of sequence alignment might
be improved if the residues are properly clustered. Here, by using a database of aligned protein structures (DAPS), a new
clustering method based on the substitution scores is proposed for the grouping of residues, and substitution matrices of
residues at different levels of simplification are constructed. The validity of the reduced alphabets is confirmed by relative
entropy analysis. The reduced alphabets are applied to recognition of protein structurally conserved/similar regions by sequence
alignment. The results indicate that the accuracy or efficiency of sequence alignment can be improved with the optimal reduced
alphabet with N around 9.
Supported by the National Natural Science Foundation of China (Grant Nos. 90403120, 10474041 and 10021001) and the Nonlinear
Project (973) of the NSM 相似文献
7.
In this study, n-peptide compositions are utilized for protein vectorization over a discriminative remote homology detection framework based on support vector machines (SVMs). The size of amino acid alphabet is gradually reduced for increasing values of n to make the method to conform with the memory resources in conventional workstations. A hash structure is implemented for accelerated search of n-peptides. The method is tested to see its ability to classify proteins into families on a subset of SCOP family database and compared against many of the existing homology detection methods including the most popular generative methods; SAM-98 and PSI-BLAST and the recent SVM methods; SVM-Fisher, SVM-BLAST and SVM-Pairwise. The results have demonstrated that the new method significantly outperforms SVM-Fisher, SVM-BLAST, SAM-98 and PSI-BLAST, while achieving a comparable accuracy with SVM-Pairwise. In terms of efficiency, it performs much better than SVM-Pairwise. It is shown that the information of n-peptide compositions with reduced amino acid alphabets provides an accurate and efficient means of protein vectorization for SVM-based sequence classification. 相似文献
8.
Simplified amino acid alphabets for protein fold recognition and implications for folding 总被引:6,自引:0,他引:6
Protein design experiments have shown that the use of specific subsets of amino acids can produce foldable proteins. This prompts the question of whether there is a minimal amino acid alphabet which could be used to fold all proteins. In this work we make an analogy between sequence patterns which produce foldable sequences and those which make it possible to detect structural homologs by aligning sequences, and use it to suggest the possible size of such a reduced alphabet. We estimate that reduced alphabets containing 10-12 letters can be used to design foldable sequences for a large number of protein families. This estimate is based on the observation that there is little loss of the information necessary to pick out structural homologs in a clustered protein sequence database when a suitable reduction of the amino acid alphabet from 20 to 10 letters is made, but that this information is rapidly degraded when further reductions in the alphabet are made. 相似文献
9.
Intrinsically disordered regions (IDR) play an important role in key biological processes and are closely related to human diseases. IDRs have great potential to serve as targets for drug discovery, most notably in disordered binding regions. Accurate prediction of IDRs is challenging because their genome wide occurrence and a low ratio of disordered residues make them difficult targets for traditional classification techniques. Existing computational methods mostly rely on sequence profiles to improve accuracy which is time consuming and computationally expensive. This article describes an ab initio sequence-only prediction method—which tries to overcome the challenge of accurate prediction posed by IDRs—based on reduced amino acid alphabets and convolutional neural networks (CNNs). We experiment with six different 3-letter reduced alphabets. We argue that the dimensional reduction in the input alphabet facilitates the detection of complex patterns within the sequence by the convolutional step. Experimental results show that our proposed IDR predictor performs at the same level or outperforms other state-of-the-art methods in the same class, achieving accuracy levels of 0.76 and AUC of 0.85 on the publicly available Critical Assessment of protein Structure Prediction dataset (CASP10). Therefore, our method is suitable for proteome-wide disorder prediction yielding similar or better accuracy than existing approaches at a faster speed. 相似文献
10.
An incorrect version of Figure 3 was published in the abovearticle, the corrected version is reproduced below. 相似文献
11.
Coordinated amino acid changes in homologous protein families 总被引:4,自引:0,他引:4
In the tobamovirus coat protein family, amino acid residues at some spatially close positions are found to be substituted in a coordinated manner [Altschuh et al. (1987) J. Mol. Biol., 193, 693]. Therefore, these positions show an identical pattern of amino acid substitutions when amino acid sequences of these homologous proteins are aligned. Based on this principle, coordinated substitutions have been searched for in three additional protein families: serine proteases, cysteine proteases and the haemoglobins. Coordinated changes have been found in all three protein families mostly within structurally constrained regions. This method works with a varying degree of success depending on the function of the proteins, the range of sequence similarities and the number of sequences considered. By relaxing the criteria for residue selection, the method was adapted to cover a broader range of protein families and to study regions of the proteins having weaker structural constraints. The information derived by these methods provides a general guide for engineering of a large variety of proteins to analyse structure-function relationships. 相似文献
12.
Background
The functional sites of a protein present important information for determining its cellular function and are fundamental in drug design. Accordingly, accurate methods for the prediction of functional sites are of immense value. Most available methods are based on a set of homologous sequences and structural or evolutionary information, and assume that functional sites are more conserved than the average. In the analysis presented here, we have investigated the conservation of location and type of amino acids at functional sites, and compared the behaviour of functional sites between different protein domains. 相似文献13.
Local homology recognition and distance measures in linear time using compressed amino acid alphabets 总被引:1,自引:0,他引:1
Edgar RC 《Nucleic acids research》2004,32(1):380-385
Methods for discovery of local similarities and estimation of evolutionary distance by identifying k-mers (contiguous subsequences of length k) common to two sequences are described. Given unaligned sequences of length L, these methods have O(L) time complexity. The ability of compressed amino acid alphabets to extend these techniques to distantly related proteins was investigated. The performance of these algorithms was evaluated for different alphabets and choices of k using a test set of 1848 pairs of structurally alignable sequences selected from the FSSP database. Distance measures derived from k-mer counting were found to correlate well with percentage identity derived from sequence alignments. Compressed alphabets were seen to improve performance in local similarity discovery, but no evidence was found of improvements when applied to distance estimates. The performance of our local similarity discovery method was compared with the fast Fourier transform (FFT) used in MAFFT, which has O(L log L) time complexity. The method for achieving comparable coverage to FFT is revealed here, and is more than an order of magnitude faster. We suggest using k-mer distance for fast, approximate phylogenetic tree construction, and show that a speed improvement of more than three orders of magnitude can be achieved relative to standard distance methods, which require alignments. 相似文献
14.
Correlated mutation analysis has a long history of interesting applications, mostly in the detection of contact pairs in protein structures. Based on previous observations that, if properly assessed, amino acid correlation data can also provide insights about functional sub-classes in a protein family, we provide a complete framework devoted to this purpose. An amino acid specific correlation measure is proposed, which can be used to build networks summarizing all correlation and anti-correlation patterns in a protein family. These networks can be submitted to community structure detection algorithms, resulting in subsets of correlated amino acids which can be further assessed by specific parameters and procedures that provide insight into the relationship between different communities, the individual importance of community members and the adherence of a given amino acid sequence to a given community. By applying this framework to three protein families with contrasting characteristics (the Fe/Mn-superoxide dismutases, the peroxidase-catalase family and the C-type lysozyme/α-lactalbumin family), we show how our method and the proposed parameters and procedures are related to biological characteristics observed in these protein families, highlighting their potential use in protein characterization and gene annotation. 相似文献
15.
Ye S Köhrer C Huber T Kazmi M Sachdev P Yan EC Bhagat A RajBhandary UL Sakmar TP 《The Journal of biological chemistry》2008,283(3):1525-1533
G protein-coupled receptors (GPCRs) are ubiquitous heptahelical transmembrane proteins involved in a wide variety of signaling pathways. The work described here on application of unnatural amino acid mutagenesis to two GPCRs, the chemokine receptor CCR5 (a major co-receptor for the human immunodeficiency virus) and rhodopsin (the visual photoreceptor), adds a new dimension to studies of GPCRs. We incorporated the unnatural amino acids p-acetyl-L-phenylalanine (Acp) and p-benzoyl-L-phenylalanine (Bzp) into CCR5 at high efficiency in mammalian cells to produce functional receptors harboring reactive keto groups at three specific positions. We obtained functional mutant CCR5, at levels up to approximately 50% of wild type as judged by immunoblotting, cell surface expression, and ligand-dependent calcium flux. Rhodopsin containing Acp at three different sites was also purified in high yield (0.5-2 microg/10(7) cells) and reacted with fluorescein hydrazide in vitro to produce fluorescently labeled rhodopsin. The incorporation of reactive keto groups such as Acp or Bzp into GPCRs allows their reaction with different reagents to introduce a variety of spectroscopic and other probes. Bzp also provides the possibility of photo-cross-linking to identify precise sites of protein-protein interactions, including GPCR binding to G proteins and arrestins, and for understanding the molecular basis of ligand recognition by chemokine receptors. 相似文献
16.
Nucleic acid polymers selected from random sequence space constitute an enormous array of catalytic, diagnostic and therapeutic molecules. Despite the fact that proteins are robust polymers with far greater chemical and physical diversity, success in unlocking protein sequence space remains elusive. We have devised a combinatorial strategy for accessing nucleic acid sequence space corresponding to proteins comprising selected amino acid alphabets. Using the SynthOMIC approach (synthesis of ORFs by multimerizing in-frame codons), representative libraries comprising four amino acid alphabets were fused in-frame to the lambda repressor DNA-binding domain to provide an in vivo selection for self-interacting proteins that re-constitute lambda repressor function. The frequency of self-interactors as a function of amino acid composition ranged over five orders of magnitude, from ∼6% of clones in a library comprising the amino acid residues LARE to ∼0.6 in 106 in the MASH library. Sequence motifs were evident by inspection in many cases, and individual clones from each library presented substantial sequence identity with translated proteins by BLAST analysis. We posit that the SynthOMIC approach represents a powerful strategy for creating combinatorial libraries of open reading frames that distils protein sequence space on the basis of three inherent properties: it supports the use of selected amino acid alphabets, eliminates redundant sequences and locally constrains amino acids. 相似文献
17.
18.
Turutina VP Laskin AA Kudryashov NA Skryabin KG Korotkov EV 《Biochemistry. Biokhimii?a》2006,71(1):18-31
For detection of the latent periodicity of the protein families responsible for various biological functions, methods of information decomposition, cyclic profile alignment, and the method of noise decomposition have been used. The latent periodicity, being specific to a particular family, is recognized in 94 of 110 analyzed protein families. Family specific periodicity was found for more than 70% of amino acid sequences in each of these families. Based on such sequences the characteristic profile of the latent periodicity has been deduced for each family. Possible relationship between the recognized latent periodicity, evolution of proteins, and their structural organization is discussed. 相似文献
19.
Vera P Turutina Andrew A Laskin Nikolay A Kudryashov Konstantin G Skryabin Eugene V Korotkov 《Journal of computational biology》2006,13(4):946-964
Here, we have applied information decomposition, cyclic profile alignment, and noise decomposition techniques to search for latent repeats within protein families of various functions. We have identified 94 protein families with a family-specific periodicity. In each case, the periodic element was found in greater than 70% of family members. Latent periodicity profiles with specific length and signature were obtained in each case. The possible relationship between the periodic elements thus identified and the evolutionary development of the protein families are discussed with specific reference to the possibility that there is a correlation between the periodic elements and protein function. 相似文献
20.
Various sequence-motif and sequence-cluster databases have been integrated into a new resource known as InterPro. Because the contributing databases have different clustering principles and scoring sensitivities, the combined assignments complement each other for grouping protein families and delineating domains. InterPro and new developments in the analysis of both the phylogenetic profiles of protein families and domain fusion events improve the prediction of specific functions for numerous proteins. 相似文献