首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
The knowledge collated from the known protein structures has revealed that the proteins are usually folded into the four structural classes: all-α, all-β, α/β and α + β. A number of methods have been proposed to predict the protein's structural class from its primary structure; however, it has been observed that these methods fail or perform poorly in the cases of distantly related sequences. In this paper, we propose a new method for protein structural class prediction using low homology (twilight-zone) protein sequences dataset. Since protein structural class prediction is a typical classification problem, we have developed a Support Vector Machine (SVM)-based method for protein structural class prediction that uses features derived from the predicted secondary structure and predicted burial information of amino acid residues. The examination of different individual as well as feature combinations revealed that the combination of secondary structural content, secondary structural and solvent accessibility state frequencies of amino acids gave rise to the best leave-one-out cross-validation accuracy of ~81% which is comparable to the best accuracy reported in the literature so far.  相似文献   

2.
Guo Y  Yu L  Wen Z  Li M 《Nucleic acids research》2008,36(9):3025-3030
Compared to the available protein sequences of different organisms, the number of revealed protein-protein interactions (PPIs) is still very limited. So many computational methods have been developed to facilitate the identification of novel PPIs. However, the methods only using the information of protein sequences are more universal than those that depend on some additional information or predictions about the proteins. In this article, a sequence-based method is proposed by combining a new feature representation using auto covariance (AC) and support vector machine (SVM). AC accounts for the interactions between residues a certain distance apart in the sequence, so this method adequately takes the neighbouring effect into account. When performed on the PPI data of yeast Saccharomyces cerevisiae, the method achieved a very promising prediction result. An independent data set of 11,474 yeast PPIs was used to evaluate this prediction model and the prediction accuracy is 88.09%. The performance of this method is superior to those of the existing sequence-based methods, so it can be a useful supplementary tool for future proteomics studies. The prediction software and all data sets used in this article are freely available at http://www.scucic.cn/Predict_PPI/index.htm.  相似文献   

3.
Mishra P  Pandey PN 《Bioinformation》2011,6(10):372-374
The number of amino acid sequences is increasing very rapidly in the protein databases like Swiss-Prot, Uniprot, PIR and others, but the structure of only some amino acid sequences are found in the Protein Data Bank. Thus, an important problem in genomics is automatically clustering homologous protein sequences when only sequence information is available. Here, we use graph theoretic techniques for clustering amino acid sequences. A similarity graph is defined and clusters in that graph correspond to connected subgraphs. Cluster analysis seeks grouping of amino acid sequences into subsets based on distance or similarity score between pairs of sequences. Our goal is to find disjoint subsets, called clusters, such that two criteria are satisfied: homogeneity: sequences in the same cluster are highly similar to each other; and separation: sequences in different clusters have low similarity to each other. We tested our method on several subsets of SCOP (Structural Classification of proteins) database, a gold standard for protein structure classification. The results show that for a given set of proteins the number of clusters we obtained is close to the superfamilies in that set; there are fewer singeltons; and the method correctly groups most remote homologs.  相似文献   

4.
5.
SUMMARY: The classification of protein sequences obtained from patients with various immunoglobulin-related conformational diseases may provide insight into structural correlates of pathogenicity. However, clinical data are very sparse and, in the case of antibody-related proteins, the collected sequences have large variability with only a small subset of variations relevant to the protein pathogenicity (function). On this basis, these sequences represent a model system for development of strategies to recognize the small subset of function-determining variations among the much larger number of primary structure diversifications introduced during evolution. Under such conditions, most protein classification algorithms have limited accuracy. To address this problem, we propose a support vector machine (SVM)-based classifier that combines sequence and 3D structural averaging information. Each amino acid in the sequence is represented by a set of six physicochemical properties: hydrophobicity, hydrophilicity, volume, surface area, bulkiness and refractivity. Each position in the sequence is described by the properties of the amino acid at that position and the properties of its neighbors in 3D space or in the sequence. A structure template is selected to determine neighbors in 3D space and a window size is used to determine the neighbors in the sequence. The test data consist of 209 proteins of human antibody immunoglobulin light chains, each represented by aligned sequences of 120 amino acids. The methodology is applied to the classification of protein sequences collected from patients with and without amyloidosis, and indicates that the proposed modified classifiers are more robust to sequence variability than standard SVM classifiers, improving classification error between 5 and 25% and sensitivity between 9 and 17%. The classification results might also suggest possible mechanisms for the propensity of immunoglobulin light chains to amyloid formation.  相似文献   

6.
TESE is a web server for the generation of test sets of protein sequences and structures fulfilling a number of different criteria. At least three different use cases can be envisaged: (i) benchmarking of novel methods; (ii) test sets tailored for special needs and (iii) extending available datasets. The CATH structure classification is used to control structural/sequence redundancy and a variety of structural quality parameters can be used to interactively select protein subsets with specific characteristics, e.g. all X-ray structures of alpha-helical repeat proteins with more than 120 residues and resolution <2.0 A. The output includes FASTA-formatted sequences, PDB files and a clickable HTML index file containing images of the selected proteins. Multiple subsets for cross-validation are also supported. AVAILABILITY: The TESE server is available for non-commercial use at URL: http://protein.bio.unipd.it/tese/.  相似文献   

7.
The importance of unstructured biology has quickly grown during the last decades accompanying the explosion of the number of experimentally resolved protein structures. The idea that structural disorder might be a novel mechanism of protein interaction is widespread in the literature, although the number of statistically significant structural studies supporting this idea is surprisingly low. At variance with previous works, our conclusions rely exclusively on a large-scale analysis of all the 134337 X-ray crystallographic structures of the Protein Data Bank averaged over clusters of almost identical protein sequences. In this work, we explore the complexity of the organisation of all the interaction interfaces observed when a protein lies in alternative complexes, showing that interfaces progressively add up in a hierarchical way, which is reflected in a logarithmic law for the size of the union of the interface regions on the number of distinct interfaces. We further investigate the connection of this complexity with different measures of structural disorder: the standard missing residues and a new definition, called “soft disorder”, that covers all the flexible and structurally amorphous residues of a protein. We show evidences that both the interaction interfaces and the soft disordered regions tend to involve roughly the same amino-acids of the protein, and preliminary results suggesting that soft disorder spots those surface regions where new interfaces are progressively accommodated by complex formation. In fact, our results suggest that structurally disordered regions not only carry crucial information about the location of alternative interfaces within complexes, but also about the order of the assembly. We verify these hypotheses in several examples, such as the DNA binding domains of P53 and P73, the C3 exoenzyme, and two known biological orders of assembly. We finally compare our measures of structural disorder with several disorder bioinformatics predictors, showing that these latter are optimised to predict the residues that are missing in all the alternative structures of a protein and they are not able to catch the progressive evolution of the disordered regions upon complex formation. Yet, the predicted residues, when not missing, tend to be characterised as soft disordered regions.  相似文献   

8.
A Monte Carlo simulation based sequence design method is proposed to investigate the role of site-directed point mutations in protein misfolding. Site-directed point mutations are incorporated in the designed sequences of selected proteins. While most mutated sequences correctly fold to their native conformation, some of them stabilize in other nonnative conformations and thus misfold/unfold. The results suggest that a critical number of hydrophobic amino acid residues must be present in the core of the correctly folded proteins, whereas proteins misfold/unfold if this number of hydrophobic residues falls below the critical limit. A protein can accommodate only a particular number of hydrophobic residues at the surface, provided a large number of hydrophilic residues are present at the surface and critical hydrophobicity of the core is preserved. Some surface sites are observed to be equally sensitive toward site-directed point mutations as the core sites. Point mutations with highly polar and charged amino acids increases the misfold/unfold propensity of proteins. Substitution of natural amino acids at sites with different number of nonbonded contacts suggests that both amino acid identity and its respective site-specificity determine the stability of a protein. A clash-match method is developed to calculate the number of matching and clashing interactions in the mutated protein sequences. While misfolded/unfolded sequences have a higher number of clashing and a lower number of matching interactions, the correctly folded sequences have a lower number of clashing and a higher number of matching interactions. These results are valid for different SCOP classes of proteins.  相似文献   

9.
10.
MOTIVATION: The rapid increase in the amount of protein sequence data has created a need for an automated identification of evolutionarily related subgroups from large datasets. The existing methods typically require a priori specification of the number of putative groups, which defines the resolution of the classification solution. RESULTS: We introduce a Bayesian model-based approach to simultaneous identification of evolutionary groups and conserved parts of the protein sequences. The model-based approach provides an intuitive and efficient way of determining the number of groups from the sequence data, in contrast to the ad hoc methods often exploited for similar purposes. Our model recognizes the areas in the sequences that are relevant for the clustering and regards other areas as noise. We have implemented the method using a fast stochastic optimization algorithm which yields a clustering associated with the estimated maximum posterior probability. The method has been shown to have high specificity and sensitivity in simulated and real clustering tasks. With real datasets the method also highlights the residues close to the active site. AVAILABILITY: Software 'kPax' is available at http://www.rni.helsinki.fi/jic/softa.html  相似文献   

11.
To classify proteins into functional families based on their primary sequences, popular algorithms such as the k-NN-, HMM-, and SVM-based algorithms are often used. For many of these algorithms to perform their tasks, protein sequences need to be properly aligned first. Since the alignment process can be error-prone, protein classification may not be performed very accurately. To improve classification accuracy, we propose an algorithm, called the Unaligned Protein SEquence Classifier (UPSEC), which can perform its tasks without sequence alignment. UPSEC makes use of a probabilistic measure to identify residues that are useful for classification in both positive and negative training samples, and can handle multi-class classification with a single classifier and a single pass through the training data. UPSEC has been tested with real protein data sets. Experimental results show that UPSEC can effectively classify unaligned protein sequences into their corresponding functional families, and the patterns it discovers during the training process can be biologically meaningful.  相似文献   

12.
Locating sequences compatible with a protein structural fold is the well‐known inverse protein‐folding problem. While significant progress has been made, the success rate of protein design remains low. As a result, a library of designed sequences or profile of sequences is currently employed for guiding experimental screening or directed evolution. Sequence profiles can be computationally predicted by iterative mutations of a random sequence to produce energy‐optimized sequences, or by combining sequences of structurally similar fragments in a template library. The latter approach is computationally more efficient but yields less accurate profiles than the former because of lacking tertiary structural information. Here we present a method called SPIN that predicts Sequence Profiles by Integrated Neural network based on fragment‐derived sequence profiles and structure‐derived energy profiles. SPIN improves over the fragment‐derived profile by 6.7% (from 23.6 to 30.3%) in sequence identity between predicted and wild‐type sequences. The method also reduces the number of residues in low complex regions by 15.7% and has a significantly better balance of hydrophilic and hydrophobic residues at protein surface. The accuracy of sequence profiles obtained is comparable to those generated from the protein design program RosettaDesign 3.5. This highly efficient method for predicting sequence profiles from structures will be useful as a single‐body scoring term for improving scoring functions used in protein design and fold recognition. It also complements protein design programs in guiding experimental design of the sequence library for screening and directed evolution of designed sequences. The SPIN server is available at http://sparks‐lab.org . Proteins 2014; 82:2565–2573. © 2014 Wiley Periodicals, Inc.  相似文献   

13.
14.
Coordinated amino acid changes in homologous protein families   总被引:4,自引:0,他引:4  
In the tobamovirus coat protein family, amino acid residues at some spatially close positions are found to be substituted in a coordinated manner [Altschuh et al. (1987) J. Mol. Biol., 193, 693]. Therefore, these positions show an identical pattern of amino acid substitutions when amino acid sequences of these homologous proteins are aligned. Based on this principle, coordinated substitutions have been searched for in three additional protein families: serine proteases, cysteine proteases and the haemoglobins. Coordinated changes have been found in all three protein families mostly within structurally constrained regions. This method works with a varying degree of success depending on the function of the proteins, the range of sequence similarities and the number of sequences considered. By relaxing the criteria for residue selection, the method was adapted to cover a broader range of protein families and to study regions of the proteins having weaker structural constraints. The information derived by these methods provides a general guide for engineering of a large variety of proteins to analyse structure-function relationships.  相似文献   

15.
Nishi H  Koike R  Ota M 《Proteins》2011,79(8):2372-2379
We investigated fragmental sequences that were inserted into proteins during long molecular evolution and relevant to the association of homo-oligomers. Seventeen insertions in 12 SCOP (structure classification of proteins) families were examined and were classified into large and small insertions. The large insertions are composed of interface-like residues and effectively increase the interface area. In contrast, small insertions are composed of the residues that are not commonly found at the interfaces and have a small interface area: their roles in the oligomerization process are unclear. We found that the small insertions were located in the middle of protein sequences and therefore must involve residues with strong turn and less interface-like propensities. From a structural viewpoint, small insertions were found to mask hydrophobic patches or act as spacers to fill cavities present at interfaces. The presence or absence of small insertions coincides with the annotated oligomeric states for homologs in the SwissProt database, and the calculation of the association scores predicts that small insertions contribute to the stability of oligomers. These results support the significant role of small, nonhydrophobic insertions in protein oligomerization.  相似文献   

16.
The importance of intrinsic disorder for protein phosphorylation   总被引:2,自引:0,他引:2  
Reversible protein phosphorylation provides a major regulatory mechanism in eukaryotic cells. Due to the high variability of amino acid residues flanking a relatively limited number of experimentally identified phosphorylation sites, reliable prediction of such sites still remains an important issue. Here we report the development of a new web-based tool for the prediction of protein phosphorylation sites, DISPHOS (DISorder-enhanced PHOSphorylation predictor, http://www.ist.temple. edu/DISPHOS). We observed that amino acid compositions, sequence complexity, hydrophobicity, charge and other sequence attributes of regions adjacent to phosphorylation sites are very similar to those of intrinsically disordered protein regions. Thus, DISPHOS uses position-specific amino acid frequencies and disorder information to improve the discrimination between phosphorylation and non-phosphorylation sites. Based on the estimates of phosphorylation rates in various protein categories, the outputs of DISPHOS are adjusted in order to reduce the total number of misclassified residues. When tested on an equal number of phosphorylated and non-phosphorylated residues, the accuracy of DISPHOS reaches 76% for serine, 81% for threonine and 83% for tyrosine. The significant enrichment in disorder-promoting residues surrounding phosphorylation sites together with the results obtained by applying DISPHOS to various protein functional classes and proteomes, provide strong support for the hypothesis that protein phosphorylation predominantly occurs within intrinsically disordered protein regions.  相似文献   

17.
Several recent studies indicate that a single polypeptide may act as the beta-subunit of prolyl 4-hydroxylase, the enzyme protein disulphide-isomerase and a cellular thyroid-hormone-binding protein. We report here the isolation and characterization of cDNA clones encoding this multifunctional protein in the chicken. All the coding sequences were determined on the basis of nucleotide sequencing of five cDNA clones and amino acid sequencing of the N-terminal end of the chicken beta-subunit. The processed polypeptide contains 493 amino acid residues, the size of the respective mRNA being about 2.7 kb. The chicken beta-subunit cDNA sequences were 78% homologous to the previously reported human beta-subunit cDNA sequences at the nucleotide level and 85% homologous at the amino acid level. The homology of the chicken beta-subunit sequences to those reported for bovine thyroid-hormone-binding protein and rat protein disulphide-isomerase was also 85% at the amino acid level. Primary-structure comparisons between the four species indicated that the two proposed active sites of protein disulphide-isomerase, the two Trp-Cys-Gly-His-Cys-Lys sequences, are located within highly conserved regions, which are also homologous to the active sites of a number of thioredoxins. The middle of the polypeptide has an additional conserved region 100 amino acid residues in length in which the degree of homology between the four species is 94% at the amino acid level. This long conserved region may also be important for some of the multiple functions of the protein. The four extreme C-terminal amino acids of the polypeptide in all four species are Lys-Asp-Glu-Leu, a sequence that has been suggested to function as a signal for the retention of a protein in the endoplasmic reticulum.  相似文献   

18.
19.
Hybrid system for protein secondary structure prediction.   总被引:13,自引:0,他引:13  
We have developed a hybrid system to predict the secondary structures (alpha-helix, beta-sheet and coil) of proteins and achieved 66.4% accuracy, with correlation coefficients of C(coil) = 0.429, C alpha = 0.470 and C beta = 0.387. This system contains three subsystems ("experts"): a neural network module, a statistical module and a memory-based reasoning module. First, the three experts independently learn the mapping between amino acid sequences and secondary structures from the known protein structures, then a Combiner learns to combine automatically the outputs of the experts to make final predictions. The hybrid system was tested with 107 protein structures through k-way cross-validation. Its performance was better than each expert and all previously reported methods with greater than 0.99 statistical significance. It was observed that for 20% of the residues, all three experts produced the same but wrong predictions. This may suggest an upper bound on the accuracy of secondary structure predictions based on local information from the currently available protein structures, and indicate places where non-local interactions may play a dominant role in conformation. For 64% of the residues, at least two experts were the same and correct, which shows that the Combiner performed better than majority vote. For 77% of the residues, at least one expert was correct, thus there may still be room for improvement in this hybrid approach. Rigorous evaluation procedures were used in testing the hybrid system, and statistical significance measures were developed in analyzing the differences among different methods. When measured in terms of the number of secondary structures (rather than the number of residues) that were predicted correctly, the prediction produced by the hybrid system was also better than those of individual experts.  相似文献   

20.
Approaching a complete classification of protein secondary structure   总被引:2,自引:0,他引:2  
A complete classification of types of the protein secondary structure is developed on the basis of computer analysis of the crystallographic structural data deposited in the protein Data Bank. The majority of amino acid residues fall into five conformation types. A conclusion is drawn that the number of sequence variants of torsion angles phi, psi in globular proteins is limited and is essentially less than the number of possible amino acid sequences for this chain length. Along with alpha-helix and beta-structure, the distribution analysis assigning every maximum of distribution of amino acid conformations on Ramachandran map to a certain type of the secondary structure exposed a third type of the secondary structure that was previously neglected. This type of the structure is extended left-handed helical conformation, designated as mobile (M-) conformation. A full set of M-conformation fragments that seems to play a major role in protein globule dynamics has been obtained, a small radius of correlation for the polypeptide chain in M-conformation is demonstrated. It explains a prevalence of short segments of mobile conformation revealed in globular proteins. For secondary structure types, the frequency of occurrence of amino acid residues has been computed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号