首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 546 毫秒
1.
Structural genomics projects aim to provide a sharp increase in the number of structures of functionally unannotated, and largely unstudied, proteins. Algorithms and tools capable of deriving information about the nature, and location, of functional sites within a structure are increasingly useful therefore. Here, a neural network is trained to identify the catalytic residues found in enzymes, based on an analysis of the structure and sequence. The neural network output, and spatial clustering of the highly scoring residues are then used to predict the location of the active site.A comparison of the performance of differently trained neural networks is presented that shows how information from sequence and structure come together to improve the prediction accuracy of the network. Spatial clustering of the network results provides a reliable way of finding likely active sites. In over 69% of the test cases the active site is correctly predicted, and a further 25% are partially correctly predicted. The failures are generally due to the poor quality of the automatically generated sequence alignments.We also present predictions identifying the active site, and potential functional residues in five recently solved enzyme structures, not used in developing the method. The method correctly identifies the putative active site in each case. In most cases the likely functional residues are identified correctly, as well as some potentially novel functional groups.  相似文献   

2.
A systematic optimization model for binding sequence selection in computational enzyme design was developed based on the transition state theory of enzyme catalysis and graph‐theoretical modeling. The saddle point on the free energy surface of the reaction system was represented by catalytic geometrical constraints, and the binding energy between the active site and transition state was minimized to reduce the activation energy barrier. The resulting hyperscale combinatorial optimization problem was tackled using a novel heuristic global optimization algorithm, which was inspired and tested by the protein core sequence selection problem. The sequence recapitulation tests on native active sites for two enzyme catalyzed hydrolytic reactions were applied to evaluate the predictive power of the design methodology. The results of the calculation show that most of the native binding sites can be successfully identified if the catalytic geometrical constraints and the structural motifs of the substrate are taken into account. Reliably predicting active site sequences may have significant implications for the creation of novel enzymes that are capable of catalyzing targeted chemical reactions.  相似文献   

3.
MOTIVATION: Clustering of protein sequences is widely used for the functional characterization of proteins. However, it is still not easy to cluster distantly-related proteins, which have only regional similarity among their sequences. It is therefore necessary to develop an algorithm for clustering such distantly-related proteins. RESULTS: We have developed a time and space efficient clustering algorithm. It uses a graph representation where its vertices and edges denote proteins and their sequence similarities above a certain cutoff score, respectively. It repeatedly partitions the graph by removing edges that have small weights, which correspond to low sequence similarities. To find the appropriate partitions, we introduce a score combining the normalized cut and a locally minimal cut capacities. Our method is applied to the entire 40,703 human proteins in SWISS-PROT and TrEMBL. The resulting clusters shows a 76% recall (20,529 proteins) of the 26,917 classified by InterPro. It also finds relationships not found by other clustering methods. AVAILABILITY: The complete result of our algorithm for all the human proteins in SWISS-PROT and TrEMBL, and other supplementary information are available at http://motif.ics.es.osaka-u.ac.jp/Ncut-KL/  相似文献   

4.
ABSTRACT: BACKGROUND: Identification of protein structural cores requires isolation of sets of proteins all sharing a same subset of structural motifs. In the context of ever growing number of available 3D protein structures, standard and automatic clustering algorithms require adaptations so as to allow for efficient identification of such sets of proteins. RESULTS: When considering a pair of 3D structures, they are stated as similar or not according to the local similarities of their matching substructures in a structural alignment. This binary relation can be represented in a graph of similarities where a node represents a 3D protein structure and an edge states that two 3D protein structures are similar. Therefore, the classification of proteins into structural families can be viewed as graph clustering task. Unfortunately, because such a graph encodes only pairwise similarity information, clustering algorithms may group in the same cluster a subset of 3D structures that do not share a common substructure. To overcome this drawback we first define a ternary similarity on a triple of 3D structures as a constraint to be satisfied by the graph of similarities. Such a ternary constraint takes into account similarities between pairwise alignments, so as to ensure that the three involved protein structures do have some common substructure. We propose hereunder a modification algorithm that eliminates edges from the original graph of similarities and outputs a reduced graph in which no ternary constraints are violated. Our proposition is then first to build a graph of similarities, then to reduce the graph according to the modification algorithm, and finally to apply to the reduced graph a standard graph clustering algorithm. We applied this method to ASTRAL-40 non-redundant protein domains, identifying significant pairwise similarities with Yakusa, a program devised for rapid 3D structure alignments. CONCLUSIONS: We show that filtering similarities prior to standard graph based clustering process by applying ternary similarity constraints i) improves the separation of proteins of different classes and consequently ii) improves the classification quality of standard graph based clustering algorithms according to the reference classification SCOP.  相似文献   

5.
The alpha/beta barrel fold is adopted by most enzymes performing a variety of catalytic reactions, but with very low sequence similarity. In order to understand the stabilizing interactions important in maintaining the alpha/beta barrel fold, we have identified residue clusters in a dataset of 36 alpha/beta barrel proteins that have less than 10% sequence identity within themselves. A graph theoretical algorithm is used to identify backbone clusters. This approach uses the global information of the nonbonded interaction in the alpha/beta barrel fold for the clustering procedure. The nonbonded interactions are represented mathematically in the form of an adjacency matrix. On diagonalizing the adjacency matrix, clusters and cluster centers are obtained from the highest eigenvalue and its corresponding vector components. Residue clusters are identified in the strand regions forming the beta barrel and are topologically conserved in all 36 proteins studied. The residues forming the cluster in each of the alpha/beta protein are also conserved among the sequences belonging to the same family. The cluster centers are found to occur in the middle of the strands or in the C-terminal of the strands. In most cases, the residues forming the clusters are part of the active site or are located close to the active site. The folding nucleus of the alpha/beta fold is predicted based on hydrophobicity index evaluation of residues and identification of cluster centers. The predicted nucleation sites are found to occur mostly in the middle of the strands. Proteins 2001;43:103-112.  相似文献   

6.
We describe a new method for identifying the sequences that signal the start of translation, and the boundaries between exons and introns (donor and acceptor sites) in human mRNA. According to the mandatory keyword, ORGANISM, and feature key, CDS, a large set of standard data for each signal site was extracted from the ASCII flat file, gbpri.seq, in the GenBank release 108.0. This was used to generate the scoring matrices, which summarize the sequence information for each signal site. The scoring matrices take into account the independent nucleotide frequencies between adjacent bases in each position within the signal site regions, and the relative weight on each nucleotide in proportion to their probabilities in the known signal sites. Using a scoring scheme that is based on the nucleotide scoring matrices, the method has great sensitivity and specificity when used to locate signals in uncharacterized human genomic DNA. These matrices are especially effective at distinguishing true and false sites.  相似文献   

7.
Even in the absence of a template, glycosyltransferases can catalyze the synthesis of carbohydrate polymers of specific sequence. The paradigm has been that one enzyme catalyzes the formation of one type of glycosidic linkage, yet certain glycosyltransferases generate polysaccharide sequences composed of two distinct linkage types. In principle, bifunctional glycosyltransferases can possess separate active sites for each catalytic activity or one active site with dual activities. We encountered the fundamental question of one or two distinct active sites in our investigation of the galactosyltransferase GlfT2. GlfT2 catalyzes the formation of mycobacterial galactan, a critical cell-wall polymer composed of galactofuranose residues connected with alternating, regioisomeric linkages. We found that GlfT2 mediates galactan polymerization using only one active site that manifests dual regioselectivity. Structural modeling of the bifunctional glycosyltransferases hyaluronan synthase and cellulose synthase suggests that these enzymes also generate multiple glycosidic linkages using a single active site. These results highlight the versatility of glycosyltransferases for generating polysaccharides of specific sequence. We postulate that a hallmark of processive elongation of a carbohydrate polymer by a bifunctional enzyme is that one active site can give rise to two separate types of glycosidic bonds.  相似文献   

8.
Dou Y  Geng X  Gao H  Yang J  Zheng X  Wang J 《The protein journal》2011,30(4):229-239
Predicting catalytic sites of a given enzyme is an important open problem of Bioinformatics. Recently, many machine learning-based methods have been developed which have the advantage that they can account for many sequential or structural features. We found that although many kinds of features are incorporated, protein sequence conservation is the main part of information they used and should play an important role in the future. So we tested several conservation features in their ability to predict catalytic sites by using the Support Vector Machine classifier. Our results suggest that position specific scoring matrix performs better than other features and incorporating conservation information of sequentially adjacent sites is more effective than that of structurally adjacent ones. Moreover, although conservation information is effective in predicting catalytic sites, it is a difficult problem to optimize the combination of conservation features and other ones.  相似文献   

9.
Mishra P  Pandey PN 《Bioinformation》2011,6(10):372-374
The number of amino acid sequences is increasing very rapidly in the protein databases like Swiss-Prot, Uniprot, PIR and others, but the structure of only some amino acid sequences are found in the Protein Data Bank. Thus, an important problem in genomics is automatically clustering homologous protein sequences when only sequence information is available. Here, we use graph theoretic techniques for clustering amino acid sequences. A similarity graph is defined and clusters in that graph correspond to connected subgraphs. Cluster analysis seeks grouping of amino acid sequences into subsets based on distance or similarity score between pairs of sequences. Our goal is to find disjoint subsets, called clusters, such that two criteria are satisfied: homogeneity: sequences in the same cluster are highly similar to each other; and separation: sequences in different clusters have low similarity to each other. We tested our method on several subsets of SCOP (Structural Classification of proteins) database, a gold standard for protein structure classification. The results show that for a given set of proteins the number of clusters we obtained is close to the superfamilies in that set; there are fewer singeltons; and the method correctly groups most remote homologs.  相似文献   

10.
11.
The hydrolases and transferases that constitute the alpha-amylase family are multidomain proteins, but each has a catalytic domain in the form of a (beta/alpha)(8)-barrel, with the active site being at the C-terminal end of the barrel beta-strands. Although the enzymes are believed to share the same catalytic acids and a common mechanism of action, they have been assigned to three separate families - 13, 70 and 77 - in the classification scheme for glycoside hydrolases and transferases that is based on amino acid sequence similarities. Each enzyme has one glutamic acid and two aspartic acid residues necessary for activity, while most enzymes of the family also contain two histidine residues critical for transition state stabilisation. These five residues occur in four short sequences conserved throughout the family, and within such sequences some key amino acid residues are related to enzyme specificity. A table is given showing motifs distinctive for each specificity as extracted from 316 sequences, which should aid in identifying the enzyme from primary structure information. Where appropriate, existing problems with identification of some enzymes of the family are pointed out. For enzymes of known three-dimensional structure, action is discussed in terms of molecular architecture. The sequence-specificity and structure-specificity relationships described may provide useful pointers for rational protein engineering.  相似文献   

12.
We present a neural network based method (ChloroP) for identifying chloroplast transit peptides and their cleavage sites. Using cross-validation, 88% of the sequences in our homology reduced training set were correctly classified as transit peptides or nontransit peptides. This performance level is well above that of the publicly available chloroplast localization predictor PSORT. Cleavage sites are predicted using a scoring matrix derived by an automatic motif-finding algorithm. Approximately 60% of the known cleavage sites in our sequence collection were predicted to within +/-2 residues from the cleavage sites given in SWISS-PROT. An analysis of 715 Arabidopsis thaliana sequences from SWISS-PROT suggests that the ChloroP method should be useful for the identification of putative transit peptides in genome-wide sequence data. The ChloroP predictor is available as a web-server at http://www.cbs.dtu.dk/services/ChloroP/.  相似文献   

13.
LAS enzymes are a group of metallopeptidases that share an active site architecture and a core folding motif and have been named according to the group members lysostaphin, D-Ala-D-Ala carboxypeptidase and sonic hedgehog. Escherichia coli MepA is a periplasmic, penicillin-insensitive murein endopeptidase that cleaves the D-alanyl-meso-2,6-diamino-pimelyl amide bond in E. coli peptidoglycan. The enzyme lacks sequence similarity with other peptidases, and is currently classified as a peptidase of unknown fold and catalytic class in all major data bases. Here, we build on our observation that two motifs, characteristic of the newly described LAS group of metallopeptidases, are conserved in MepA-type sequences. We demonstrate that recombinant E. coli MepA is sensitive to metal chelators and that mutations in the predicted Zn2+ ligands His-113, Asp-120, and His-211 inactivate the enzyme. Moreover, we present the crystal structure of MepA. The active site of the enzyme is most similar to the active sites of lysostaphin and D-Ala-D-Ala carboxypeptidase, and the fold is most closely related to the N-domain of sonic hedgehog. We conclude that MepA-type peptidases are LAS enzymes.  相似文献   

14.
We present a method for automatically extracting groups of orthologous genes from a large set of genomes by a new clustering algorithm on a weighted multipartite graph. The method assigns a score to an arbitrary subset of genes from multiple genomes to assess the orthologous relationships between genes in the subset. This score is computed using sequence similarities between the member genes and the phylogenetic relationship between the corresponding genomes. An ortholog cluster is found as the subset with the highest score, so ortholog clustering is formulated as a combinatorial optimization problem. The algorithm for finding an ortholog cluster runs in time O(|E| + |V| log |V|), where V and E are the sets of vertices and edges, respectively, in the graph. However, if we discretize the similarity scores into a constant number of bins, the runtime improves to O(|E| + |V|). The proposed method was applied to seven complete eukaryote genomes on which the manually curated database of eukaryotic ortholog clusters, KOG, is constructed. A comparison of our results with the manually curated ortholog clusters shows that our clusters are well correlated with the existing clusters  相似文献   

15.
基于质粒DNA匹配问题的分子算法   总被引:7,自引:0,他引:7  
给定无向图,图的最小极大匹配问题是寻找每条边都不相邻的最大集中的最小者,这个问题是著名的NP-完全问题.1994年Adleman博士首次提出用DNA计算解决NP-完全问题,以编码的DNA序列为运算对象,通过分子生物学的运算操作解决复杂的数学难题,使得NP-完全问题的求解可能得到解决.提出了基于质粒DNA的无向图的最大匹配问题的DNA分子生物算法,通过限制性内切酶的酶切和凝胶电泳完成解的产生和最终接的分离,依据分子生物学的实验手段,算法是有效并且可行的.  相似文献   

16.
17.
An important task in functional genomics is to cluster homologous proteins, which may share common functions. Annotating proteins of unknown function by transferring annotations from their homologues of known annotations is one of the most efficient ways to predict protein function. In this paper, we use a modularity-based method called CD for grouping together homologous proteins. The method employs a global heuristic search strategy to find the partitioning of the weighted adjacency graph with the largest modularity. The weighted adjacency graph is constructed by the sigmodal transformation of all pairwise sequence similarities between all protein sequences in a given dataset. The method has been extensively tested on several subsets from the superfamily level of the SCOP (Structural Classification of Proteins) database, where some homologous proteins have very low sequence similarity. Compared with a widely used method MCL, we observe that the number of clusters obtained by CD is closer to the number of superfamilies in the dataset, the value of the F-measure given by CD is 10% better than MCL on average, and CD is more tolerant to noise to the sequence similarity. The experiment results indicate that CD is ideally suitable for clustering homologous proteins when sequence similarity is low.  相似文献   

18.
19.
MOTIVATION: Predicting protein function is a central problem in bioinformatics, and many approaches use partially or fully automated methods based on various combination of sequence, structure and other information on proteins or genes. Such information establishes relationships between proteins that can be modelled most naturally as edges in graphs. A priori, however, it is often unclear which edges from which graph may contribute most to accurate predictions. For that reason, one established strategy is to integrate all available sources, or graphs as in graph integration, in the hope that the positive signals will add to each other. However, in the problem of functional prediction, noise, i.e. the presence of inaccurate or false edges, can still be large enough that integration alone has little effect on prediction accuracy. In order to reduce noise levels and to improve integration efficiency, we present here a recent method in graph-based learning, graph sharpening, which provides a theoretically firm yet intuitive and practical approach for disconnecting undesirable edges from protein similarity graphs. This approach has several attractive features: it is quick, scalable in the number of proteins, robust with respect to errors and tolerant of very diverse types of protein similarity measures. RESULTS: We tested the classification accuracy in a test set of 599 proteins with remote sequence homology spread over 20 Gene Ontology (GO) functional classes. When compared to integration alone, graph sharpening plus integration of four vastly different molecular similarity measures improved the overall classification by nearly 30% [0.17 average increase in the area under the ROC curve (AUC)]. Moreover, and partially through the increased sparsity of the graphs induced by sharpening, this gain in accuracy came at negligible computational cost: sharpening and integration took on average 4.66 (+/-4.44) CPU seconds. AVAILABILITY: Software and Supplementary data will be available on http://mammoth.bcm.tmc.edu/  相似文献   

20.
Predicting RNA 3D structure from sequence is a major challenge in biophysics. An important sub-goal is accurately identifying recurrent 3D motifs from RNA internal and hairpin loop sequences extracted from secondary structure (2D) diagrams. We have developed and validated new probabilistic models for 3D motif sequences based on hybrid Stochastic Context-Free Grammars and Markov Random Fields (SCFG/MRF). The SCFG/MRF models are constructed using atomic-resolution RNA 3D structures. To parameterize each model, we use all instances of each motif found in the RNA 3D Motif Atlas and annotations of pairwise nucleotide interactions generated by the FR3D software. Isostericity relations between non-Watson–Crick basepairs are used in scoring sequence variants. SCFG techniques model nested pairs and insertions, while MRF ideas handle crossing interactions and base triples. We use test sets of randomly-generated sequences to set acceptance and rejection thresholds for each motif group and thus control the false positive rate. Validation was carried out by comparing results for four motif groups to RMDetect. The software developed for sequence scoring (JAR3D) is structured to automatically incorporate new motifs as they accumulate in the RNA 3D Motif Atlas when new structures are solved and is available free for download.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号