首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 703 毫秒
1.
We have recently described a method based on artificial neural networks to cluster protein sequences into families. The network was trained with Kohonen''s unsupervised learning algorithm using, as inputs, the matrix patterns derived from the dipeptide composition of the proteins. We present here a large-scale application of that method to classify the 1,758 human protein sequences stored in the SwissProt database (release 19.0), whose lengths are greater than 50 amino acids. In the final 2-dimensional topologically ordered map of 15 x 15 neurons, proteins belonging to known families were associated with the same neuron or with neighboring ones. Also, as an attempt to reduce the time-consuming learning procedure, we compared 2 learning protocols: one of 500 epochs (100 SUN CPU-hours [CPU-h]), and another one of 30 epochs (6.7 CPU-h). A further reduction of learning-computing time, by a factor of about 3.3, with similar protein clustering results, was achieved using a matrix of 11 x 11 components to represent the sequences. Although network training is time consuming, the classification of a new protein in the final ordered map is very fast (14.6 CPU-seconds). We also show a comparison between the artificial neural network approach and conventional methods of biosequence analysis.  相似文献   

2.
多序列比对是生物信息学中基础而又重要的序列分析方法.本文提出一种新的多序列比对算法,该算法综合了渐进比对方法和迭代策略,采用加权函数以调整序列的有偏分布,用neighbor-joining方法构建指导树以确定渐进比对的顺序.通过对BAlibASE中142组蛋白质序列比对的测试,验证了本算法的有效性.与Multalin算法比较的结果表明,本算法能有效地提高分歧较大序列的比对准确率.  相似文献   

3.
Prediction of amino acid sequence from structure   总被引:2,自引:0,他引:2       下载免费PDF全文
We have developed a method for the prediction of an amino acid sequence that is compatible with a three-dimensional backbone structure. Using only a backbone structure of a protein as input, the algorithm is capable of designing sequences that closely resemble natural members of the protein family to which the template structure belongs. In general, the predicted sequences are shown to have multiple sequence profile scores that are dramatically higher than those of random sequences, and sometimes better than some of the natural sequences that make up the superfamily. As anticipated, highly conserved but poorly predicted residues are often those that contribute to the functional rather than structural properties of the protein. Overall, our analysis suggests that statistical profile scores of designed sequences are a novel and valuable figure of merit for assessing and improving protein design algorithms.  相似文献   

4.
The rational design of peptide and protein helices is not only of practical importance for protein engineering but also is a useful approach in attempts to improve our understanding of protein folding. Recent modifications of theoretical models of helix‐coil transitions allow accurate predictions of the helix stability of monomeric peptides in water and provide new possibilities for protein design. We report here a new method for the design of α‐helices in peptides and proteins using AGADIR, the statistical mechanical theory for helix‐coil transitions in monomeric peptides and the tunneling algorithm of global optimization of multidimensional functions for optimization of amino acid sequences. CD measurements of helical content of peptides with optimized sequences indicate that the helical potential of protein amino acids is high enough to allow formation of stable α‐helices in peptides as short as of 10 residues in length. The results show the maximum achievable helix content (HC) of short peptides with fully optimized sequences at 5 °C is expected to be ~70–75%. Under certain conditions the method can be a powerful practical tool for protein engineering. Unlike traditional approaches that are often used to increase protein stability by adding a few favorable interactions to the protein structure, this method deals with all possible sequences of protein helices and selects the best one from them. Copyright © 2009 European Peptide Society and John Wiley & Sons, Ltd.  相似文献   

5.
Paul Mach  Patrice Koehl 《Proteins》2013,81(9):1556-1570
It is well known that protein fold recognition can be greatly improved if models for the underlying evolution history of the folds are taken into account. The improvement, however, exists only if such evolutionary information is available. To circumvent this limitation for protein families that only have a small number of representatives in current sequence databases, we follow an alternate approach in which the benefits of including evolutionary information can be recreated by using sequences generated by computational protein design algorithms. We explore this strategy on a large database of protein templates with 1747 members from different protein families. An automated method is used to design sequences for these templates. We use the backbones from the experimental structures as fixed templates, thread sequences on these backbones using a self‐consistent mean field approach, and score the fitness of the corresponding models using a semi‐empirical physical potential. Sequences designed for one template are translated into a hidden Markov model‐based profile. We describe the implementation of this method, the optimization of its parameters, and its performance. When the native sequences of the protein templates were tested against the library of these profiles, the class, fold, and family memberships of a large majority (>90%) of these sequences were correctly recognized for an E‐value threshold of 1. In contrast, when homologous sequences were tested against the same library, a much smaller fraction (35%) of sequences were recognized; The structural classification of protein families corresponding to these sequences, however, are correctly recognized (with an accuracy of >88%). Proteins 2013; © 2013 Wiley Periodicals, Inc.  相似文献   

6.
We describe a new computational technique to predict conformationally switching elements in proteins from their amino acid sequences. The method, called ASP (Ambivalent Structure Predictor), analyzes results from a secondary structure prediction algorithm to identify regions of conformational ambivalence. ASP identifies ambivalent regions in 16 test protein sequences for which function involves substantial backbone rearrangements. In the test set, all sites previously described as conformational switches are correctly predicted to be structurally ambivalent regions. No such regions are predicted in three negative control protein sequences. ASP may be useful as a guide for experimental studies on protein function and motion in the absence of detailed three-dimensional structural data.  相似文献   

7.
High divergence in protein sequences makes the detection of distant protein relationships through homology-based approaches challenging. Grouping protein sequences into families, through similarities in either sequence or 3-D structure, facilitates in the improved recognition of protein relationships. In addition, strategically designed protein-like sequences have been shown to bridge distant structural domain families by serving as artificial linkers. In this study, we have augmented a search database of known protein domain families with such designed sequences, with the intention of providing functional clues to domain families of unknown structure. When assessed using representative query sequences from each family, we obtain a success rate of 94% in protein domain families of known structure. Further, we demonstrate that the augmented search space enabled fold recognition for 582 families with no structural information available a priori. Additionally, we were able to provide reliable functional relationships for 610 orphan families. We discuss the application of our method in predicting functional roles through select examples for DUF4922, DUF5131, and DUF5085. Our approach also detects new associations between families that were previously not known to be related, as demonstrated through new sub-groups of the RNA polymerase domain among three distinct RNA viruses. Taken together, designed sequences-augmented search databases direct the detection of meaningful relationships between distant protein families. In turn, they enable fold recognition and offer reliable pointers to potential functional sites that may be probed further through direct mutagenesis studies.  相似文献   

8.
The currently available body of decoded amino acid sequences of various proteins exceeds manifold the experimental capabilities of their functional annotation. Therefore, in silico annotation using bioinformatics methods becomes increasingly important. Such annotation is actually a prediction; however, this can be an important starting point for further laboratory research. This work describes a new method for predicting functionally important protein sites, SDPsite, on the basis of identification of specificity determinants. The algorithm proposed utilizes a protein family aglinment and a phylogenetic tree to predict the conserved positions and specificity determinants, map them onto the protein structure, and search for clusters of the predicted positions. Comparison of the resulting predictions with experimental data and published predictions of functional sites by other methods demonstrates that the results of SDPsite agree well with experimental data and exceed the results obtained with the majority of previous methods. SDPsite is publicly available at http://bioinf.fbb.msu.ru/SDPsite.  相似文献   

9.
Distant homologies between proteins are often discovered only after three-dimensional structures of both proteins are solved. The sequence divergence for such proteins can be so large that simple comparison of their sequences fails to identify any similarity. New generation of sensitive alignment tools use averaged sequences of entire homologous families (profiles) to detect such homologies. Several algorithms, including the newest generation of BLAST algorithms and BASIC, an algorithm used in our group to assign fold predictions for proteins from several genomes, are compared to each other on the large set of structurally similar proteins with little sequence similarity. Proteins in the benchmark are classified according to the level of their similarity, which allows us to demonstrate that most of the improvement of the new algorithms is achieved for proteins with strong functional similarities, with almost no progress in recognizing distant fold similarities. It is also shown that details of profile calculation strongly influence its sensitivity in recognizing distant homologies. The most important choice is how to include information from diverging members of the family, avoiding generating false predictions, while accounting for entire sequence divergence within a family. PSI-BLAST takes a conservative approach, deriving a profile from core members of the family, providing a solid improvement without almost any false predictions. BASIC strives for better sensitivity by increasing the weight of divergent family members and paying the price in lower reliability. A new FFAS algorithm introduced here uses a new procedure for profile generation that takes into account all the relations within the family and matches BASIC sensitivity with PSI-BLAST like reliability.  相似文献   

10.
Protein structure prediction by comparative modeling benefits greatly from the use of multiple sequence alignment information to improve the accuracy of structural template identification and the alignment of target sequences to structural templates. Unfortunately, this benefit is limited to those protein sequences for which at least several natural sequence homologues exist. We show here that the use of large diverse alignments of computationally designed protein sequences confers many of the same benefits as natural sequences in identifying structural templates for comparative modeling targets. A large-scale massively parallelized application of an all-atom protein design algorithm, including a simple model of peptide backbone flexibility, has allowed us to generate 500 diverse, non-native, high-quality sequences for each of 264 protein structures in our test set. PSI-BLAST searches using the sequence profiles generated from the designed sequences ("reverse" BLAST searches) give near-perfect accuracy in identifying true structural homologues of the parent structure, with 54% coverage. In 41 of 49 genomes scanned using reverse BLAST searches, at least one novel structural template (not found by the standard method of PSI-BLAST against PDB) is identified. Further improvements in coverage, through optimizing the scoring function used to design sequences and continued application to new protein structures beyond the test set, will allow this method to mature into a useful strategy for identifying distantly related structural templates.  相似文献   

11.
MOTIVATION: A consensus sequence for a family of related sequences is, as the name suggests, a sequence that captures the features common to most members of the family. Consensus sequences are important in various DNA sequencing applications and are a convenient way to characterize a family of molecules. RESULTS: This paper describes a new algorithm for finding a consensus sequence, using the popular optimization method known as simulated annealing. Unlike the conventional approach of finding a consensus sequence by first forming a multiple sequence alignment, this algorithm searches for a sequence that minimises the sum of pairwise distances to each of the input sequences. The resulting consensus sequence can then be used to induce a multiple sequence alignment. The time required by the algorithm scales linearly with the number of input sequences and quadratically with the length of the consensus sequence. We present results demonstrating the high quality of the consensus sequences and alignments produced by the new algorithm. For comparison, we also present similar results obtained using ClustalW. The new algorithm outperforms ClustalW in many cases.  相似文献   

12.
A secondary structure has been predicted for the heat shock protein HSP90 family from an aligned set of homologous protein sequences by using a transparent method in both manual and automated implementation that extracts conformational information from patterns of variation and conservation within the family. No statistically significant sequence similarity relates this family to any protein with known crystal structure. However, the secondary structure prediction, together with the assignment of active site positions and possible biochemical properties, suggest that the fold is similar to that seen in N-terminal domain of DNA gyrase B (the ATPase fragment). Proteins 27:450–458, 1997. © 1997 Wiley-Liss, Inc.  相似文献   

13.
The analysis and comparison of large numbers of immunoglobulin (Ig) sequences that arise during an antibody selection campaign can be time‐consuming and tedious. Typically, the identification and annotation of framework as well as complementarity‐determining regions (CDRs) is based on multiple sequence alignments using standardized numbering schemes, which allow identification of equivalent residues among different family members but often necessitate expert knowledge and manual intervention. Moreover, due to the enormous length variability of some CDRs the benefit of conventional Ig numbering schemes is limited and the calculation of correct sequence alignments can become challenging. Whereas, in principle, a well established set of rules permits the assignment of CDRs from the amino acid sequence alone, no currently available sequence alignment editor provides an algorithm to annotate new Ig sequences accordingly. Here we present a unique pattern matching method implemented into our recently developed ANTIC ALIgN editor that automatically identifies all hypervariable and framework regions in experimentally elucidated antibody sequences using so‐called “regular expressions.” By combination of this widely supported software syntax with the unique capabilities of real‐time aligning, editing and analyzing extended sets of amino acid and/or nucleotide sequences simultaneously on a local workstation, ANTIC ALIgN provides a powerful utility for antibody engineering. Proteins 2016; 85:65–71. © 2016 Wiley Periodicals, Inc.  相似文献   

14.
Alignment of protein sequences is a key step in most computational methods for prediction of protein function and homology-based modeling of three-dimensional (3D)-structure. We investigated correspondence between "gold standard" alignments of 3D protein structures and the sequence alignments produced by the Smith-Waterman algorithm, currently the most sensitive method for pair-wise alignment of sequences. The results of this analysis enabled development of a novel method to align a pair of protein sequences. The comparison of the Smith-Waterman and structure alignments focused on their inner structure and especially on the continuous ungapped alignment segments, "islands" between gaps. Approximately one third of the islands in the gold standard alignments have negative or low positive score, and their recognition is below the sensitivity limit of the Smith-Waterman algorithm. From the alignment accuracy perspective, the time spent by the algorithm while working in these unalignable regions is unnecessary. We considered features of the standard similarity scoring function responsible for this phenomenon and suggested an alternative hierarchical algorithm, which explicitly addresses high scoring regions. This algorithm is considerably faster than the Smith-Waterman algorithm, whereas resulting alignments are in average of the same quality with respect to the gold standard. This finding shows that the decrease of alignment accuracy is not necessarily a price for the computational efficiency.  相似文献   

15.
The sequences of several members of the myosin family of molecular motors are evaluated using ASP (Ambivalent Structure Predictor), a new computational method. ASP predicts structurally ambivalent sequence elements by analyzing the output from a secondary structure prediction algorithm. These ambivalent sequence elements form secondary structures that are hypothesized to function as switches by undergoing conformational rearrangement. For chicken skeletal muscle myosin, 13 discrete structurally ambivalent sequence elements are identified. All 13 are located in the heavy chain motor domain. When these sequence elements are mapped into the myosin tertiary structure, they form two compact regions that connect the actin binding site to the adenosine 5'-triphosphate (ATP) site, and the ATP site to the fulcrum site for the force-producing bending of the motor domain. These regions, predicted by the new algorithm to undergo conformational rearrangements, include the published known and putative switches of the myosin motor domain, and they form plausible allosteric connections between the three main functional sites of myosin. The sequences of several other members of the myosin I and II families are also analyzed.  相似文献   

16.
The calpain family of Ca2+‐dependent cysteine proteases plays a vital role in many important biological processes which is closely related with a variety of pathological states. Activated calpains selectively cleave relevant substrates at specific cleavage sites, yielding multiple fragments that can have different functions from the intact substrate protein. Until now, our knowledge about the calpain functions and their substrate cleavage mechanisms are limited because the experimental determination and validation on calpain binding are usually laborious and expensive. In this work, we aim to develop a new computational approach (LabCaS) for accurate prediction of the calpain substrate cleavage sites from amino acid sequences. To overcome the imbalance of negative and positive samples in the machine‐learning training which have been suffered by most of the former approaches when splitting sequences into short peptides, we designed a conditional random field algorithm that can label the potential cleavage sites directly from the entire sequences. By integrating the multiple amino acid features and those derived from sequences, LabCaS achieves an accurate recognition of the cleave sites for most calpain proteins. In a jackknife test on a set of 129 benchmark proteins, LabCaS generates an AUC score 0.862. The LabCaS program is freely available at: http://www.csbio.sjtu.edu.cn/bioinf/LabCaS . Proteins 2013. © 2012 Wiley Periodicals, Inc.  相似文献   

17.
An automated algorithm is presented that delineates protein sequence fragments which display similarity. The method incorporates a selection of a number of local nonoverlapping sequence alignments with the highest similarity scores and a graphtheoretical approach to elucidate the consistent start and end points of the fragments comprising one or more ensembles of related subsequences. The procedure allows the simultaneous identification of different types of repeats within one sequence. A multiple alignment of the resulting fragments is performed and a consensus sequence derived from the ensemble(s). Finally, a profile is constructed form the multiple alignment to detect possible and more distant members within the sequence. The method tolerates mutations in the repeats as well as insertions and deletions. The sequence spans between the various repeats or repeat clusters may be of different lengths. The technique has been applied to a number of proteins where the repeating fragments have been derived from information additional to the protein sequences. © 1993 Wiley-Liss, Inc.  相似文献   

18.
Kann MG  Goldstein RA 《Proteins》2002,48(2):367-376
A detailed analysis of the performance of hybrid, a new sequence alignment algorithm developed by Yu and coworkers that combines Smith Waterman local dynamic programming with a local version of the maximum-likelihood approach, was made to access the applicability of this algorithm to the detection of distant homologs by sequence comparison. We analyzed the statistics of hybrid with a set of nonhomologous protein sequences from the SCOP database and found that the statistics of the scores from hybrid algorithm follows an Extreme Value Distribution with lambda approximately 1, as previously shown by Yu et al. for the case of artificially generated sequences. Local dynamic programming was compared to the hybrid algorithm by using two different test data sets of distant homologs from the PFAM and COGs protein sequence databases. The studies were made with several score functions in current use including OPTIMA, a new score function originally developed to detect remote homologs with the Smith Waterman algorithm. We found OPTIMA to be the best score function for both both dynamic programming and the hybrid algorithms. The ability of dynamic programming to discriminate between homologs and nonhomologs in the two sets of distantly related sequences is slightly better than that of hybrid algorithm. The advantage of producing accurate score statistics with only a few simulations may overcome the small differences in performance and make this new algorithm suitable for detection of homologs in conjunction with a wide range of score functions and gap penalties.  相似文献   

19.
20.
A technique for prediction of protein membrane toplogy (intra- and extraceullular sidedness) has been developed. Membrane-spanning segments are first predicted using an algorithm based upon multiply aligned amino acid sequences. The compositional differences in the protein segments exposed at each side of the membrane are then investigated. The ratios are calculated for Asn, Asp, Gly, Phe, Pro, Trp, Tyr, and Val, mostly found on the extracellular side, and for Ala, Arg, Cys, and Lys, mostly occurring on the intracellular side. The consensus over these 12 residue distributions is used for sidedness prediction. The method was developed with a set of 42 protein families for which all but one were correctly predicted with the new algorithm. This represents an improvement over previous techniques. The new method, applied to a set of 12 membrane protein families different from the test set and with recently determined topologies, performed well, with 11 of 12 sidedness assignments agreeing with experimental results. The method has also been applied to several membrane protein families for which the topology has yet to be determined. An electronic prediction service is available at the E-mail address tmap@embl-heidelberg.de and on WWW via http://www.emblheidelberg.de.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号