首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Prediction of protein domain with mRMR feature selection and analysis   总被引:2,自引:0,他引:2  
Li BQ  Hu LL  Chen L  Feng KY  Cai YD  Chou KC 《PloS one》2012,7(6):e39308
The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28-40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine.  相似文献   

2.
It has been known even since relatively few structures had been solved that longer protein chains often contain multiple domains, which may fold separately and play the role of reusable functional modules found in many contexts. In many structural biology tasks, in particular structure prediction, it is of great use to be able to identify domains within the structure and analyze these regions separately. However, when using sequence data alone this task has proven exceptionally difficult, with relatively little improvement over the naive method of choosing boundaries based on size distributions of observed domains. The recent significant improvement in contact prediction provides a new source of information for domain prediction. We test several methods for using this information including a kernel smoothing‐based approach and methods based on building alpha‐carbon models and compare performance with a length‐based predictor, a homology search method and four published sequence‐based predictors: DOMCUT, DomPRO, DLP‐SVM, and SCOOBY‐DOmain. We show that the kernel‐smoothing method is significantly better than the other ab initio predictors when both single‐domain and multidomain targets are considered and is not significantly different to the homology‐based method. Considering only multidomain targets the kernel‐smoothing method outperforms all of the published methods except DLP‐SVM. The kernel smoothing method therefore represents a potentially useful improvement to ab initio domain prediction. Proteins 2013. © 2012 Wiley Periodicals, Inc.  相似文献   

3.
Protein domains are functional and structural units of proteins. Therefore, identification of domain–domain interactions (DDIs) can provide insight into the biological functions of proteins. In this article, we propose a novel discriminative approach for predicting DDIs based on both protein–protein interactions (PPIs) and the derived information of non‐PPIs. We make a threefold contribution to the work in this area. First, we take into account non‐PPIs explicitly and treat the domain combinations that can discriminate PPIs from non‐PPIs as putative DDIs. Second, DDI identification is formalized as a feature selection problem, in which it tries to find out a minimum set of informative features (i.e., putative DDIs) that discriminate PPIs from non‐PPIs, which is plausible in biology and is able to predict DDIs in a systematic and accurate manner. Third, multidomain combinations including two‐domain combinations are taken into account in the proposed method, where multidomain cooperations may help proteins to interact with each other. Numerical results on several DDI prediction benchmark data sets show that the proposed discriminative method performs comparably well with other top algorithms with respect to overall performance, and outperforms other methods in terms of precision. The PPI data sets used for prediction of DDIs and prediction results can be found at http://csb.shu.edu.cn/dipd . Proteins 2010. © 2009 Wiley‐Liss, Inc.  相似文献   

4.
The delineation of domain boundaries of a given sequence in the absence of known 3D structures or detectable sequence homology to known domains benefits many areas in protein science, such as protein engineering, protein 3D structure determination and protein structure prediction. With the exponential growth of newly determined sequences, our ability to predict domain boundaries rapidly and accurately from sequence information alone is both essential and critical from the viewpoint of gene function annotation. Anyone attempting to predict domain boundaries for a single protein sequence is invariably confronted with a plethora of databases that contain boundary information available from the internet and a variety of methods for domain boundary prediction. How are these derived and how well do they work? What definition of 'domain' do they use? We will first clarify the different definitions of protein domains, and then describe the available public databases with domain boundary information. Finally, we will review existing domain boundary prediction methods and discuss their strengths and weaknesses.  相似文献   

5.
The overall function of a multi‐domain protein is determined by the functional and structural interplay of its constituent domains. Traditional sequence alignment‐based methods commonly utilize domain‐level information and provide classification only at the level of domains. Such methods are not capable of taking into account the contributions of other domains in the proteins, and domain‐linker regions and classify multi‐domain proteins. An alignment‐free protein sequence comparison tool, CLAP (CLAssification of Proteins) was previously developed in our laboratory to especially handle multi‐domain protein sequences without a requirement of defining domain boundaries and sequential order of domains. Through this method we aim to achieve a biologically meaningful classification scheme for multi‐domain protein sequences. In this article, CLAP‐based classification has been explored on 5 datasets of multi‐domain proteins and we present detailed analysis for proteins containing (1) Tyrosine phosphatase and (2) SH3 domain. At the domain‐level CLAP‐based classification scheme resulted in a clustering similar to that obtained from an alignment‐based method. CLAP‐based clusters obtained for full‐length datasets were shown to comprise of proteins with similar functions and domain architectures. Our study demonstrates that multi‐domain proteins could be classified effectively by considering full‐length sequences without a requirement of identification of domains in the sequence.  相似文献   

6.
BackgroundSimilarity based computational methods are a useful tool for predicting protein functions from protein–protein interaction (PPI) datasets. Although various similarity-based prediction algorithms have been proposed, unsatisfactory prediction results have occurred on many occasions. The purpose of this type of algorithm is to predict functions of an unannotated protein from the functions of those proteins that are similar to the unannotated protein. Therefore, the prediction quality largely depends on how to select a set of proper proteins (i.e., a prediction domain) from which the functions of an unannotated protein are predicted, and how to measure the similarity between proteins. Another issue with existing algorithms is they only believe the function prediction is a one-off procedure, ignoring the fact that interactions amongst proteins are mutual and dynamic in terms of similarity when predicting functions. How to resolve these major issues to increase prediction quality remains a challenge in computational biology.ResultsIn this paper, we propose an innovative approach to predict protein functions of unannotated proteins iteratively from a PPI dataset. The iterative approach takes into account the mutual and dynamic features of protein interactions when predicting functions, and addresses the issues of protein similarity measurement and prediction domain selection by introducing into the prediction algorithm a new semantic protein similarity and a method of selecting the multi-layer prediction domain. The new protein similarity is based on the multi-layered information carried by protein functions. The evaluations conducted on real protein interaction datasets demonstrated that the proposed iterative function prediction method outperformed other similar or non-iterative methods, and provided better prediction results.ConclusionsThe new protein similarity derived from multi-layered information of protein functions more reasonably reflects the intrinsic relationships among proteins, and significant improvement to the prediction quality can occur through incorporation of mutual and dynamic features of protein interactions into the prediction algorithm.  相似文献   

7.
近年来随着生命科学新技术、新方法的涌现,酶蛋白结构和功能研究逐渐深入。具有多结构域的酶蛋白中各个结构域常具有独立的催化或结合底物的功能,在重组酶和组合生物合成研究中具有极大的研究和应用价值。这些结构域功能和组织方式的多样性,是研究分子进化的基础。对结构域进行进化分析对于研究多结构域酶的进化过程、功能相近酶之间的关系,以及对酶的分类鉴定等有重要意义。本文从结构域的重复性、结构域的水平基因转移和结构域的重组等方面出发,对多结构域酶中结构域之间进化关系的研究成果进行综述。  相似文献   

8.
Comparative studies of the proteomes from different organisms have provided valuable information about protein domain distribution in the kingdoms of life. Earlier studies have been limited by the fact that only about 50% of the proteomes could be matched to a domain. Here, we have extended these studies by including less well-defined domain definitions, Pfam-B and clustered domains, MAS, in addition to Pfam-A and SCOP domains. It was found that a significant fraction of these domain families are homologous to Pfam-A or SCOP domains. Further, we show that all regions that do not match a Pfam-A or SCOP domain contain a significantly higher fraction of disordered structure. These unstructured regions may be contained within orphan domains or function as linkers between structured domains. Using several different definitions we have re-estimated the number of multi-domain proteins in different organisms and found that several methods all predict that eukaryotes have approximately 65% multi-domain proteins, while the prokaryotes consist of approximately 40% multi-domain proteins. However, these numbers are strongly dependent on the exact choice of cut-off for domains in unassigned regions. In conclusion, all eukaryotes have similar fractions of multi-domain proteins and disorder, whereas a high fraction of repeating domain is distinguished only in multicellular eukaryotes. This implies a role for repeats in cell-cell contacts while the other two features are important for intracellular functions.  相似文献   

9.
Conserved domains represent essential building blocks of most known proteins. Owing to their role as modular components carrying out specific functions they form a network based both on functional relations and direct physical interactions. We have previously shown that domain interaction networks provide substantially novel information with respect to networks built on full-length protein chains. In this work we present a comprehensive web resource for exploring the Domain Interaction MAp (DIMA), interactively. The tool aims at integration of multiple data sources and prediction techniques, two of which have been implemented so far: domain phylogenetic profiling and experimentally demonstrated domain contacts from known three-dimensional structures. A powerful yet simple user interface enables the user to compute, visualize, navigate and download domain networks based on specific search criteria. Availability: http://mips.gsf.de/genre/proj/dima  相似文献   

10.
Sim J  Kim SY  Lee J 《Proteins》2005,59(3):627-632
Successful prediction of protein domain boundaries provides valuable information not only for the computational structure prediction of multidomain proteins but also for the experimental structure determination. Since protein sequences of multiple domains may contain much information regarding evolutionary processes such as gene-exon shuffling, this information can be detected by analyzing the position-specific scoring matrix (PSSM) generated by PSI-BLAST. We have presented a method, PPRODO (Prediction of PROtein DOmain boundaries) that predicts domain boundaries of proteins from sequence information by a neural network. The network is trained and tested using the values obtained from the PSSM generated by PSI-BLAST. A 10-fold cross-validation technique is performed to obtain the parameters of neural networks using a nonredundant set of 522 proteins containing 2 contiguous domains. PPRODO provides good and consistent results for the prediction of domain boundaries, with accuracy of about 66% using the +/-20 residue criterion. The PPRODO source code, as well as all data sets used in this work, are available from http://gene.kias.re.kr/ approximately jlee/pprodo/.  相似文献   

11.
Identification of structural domains in uncharacterized protein sequences is important in the prediction of protein tertiary folds and functional sites, and hence in designing biologically active molecules. We present a new predictive computational method of classifying a protein into single, two continuous or two discontinuous domains using Bayesian Data Mining. The algorithm requires only the primary sequence and computer-predicted secondary structure. It incorporates correlation patterns between certain 3-dimensional motifs and some local helical folds found conserved in the vicinity of protein domains with high statistical confidence. The prediction of domain-class by this computationally simple and fast method shows good accuracy of prediction-average accuracies 83.3% for single domain, 60% for two continuous and 65.7% for two discontinuous domain proteins. Experiments on the large validation sample show its performance to be significantly better than that of DGS and DomSSEA. Computations of Bayesian probabilities show important features in terms of correlation of certain conserved patterns of secondary folds and tertiary motifs and give new insight. Applications for improved accuracy of predicting domain boundary points relevant to protein structural and functional modeling are also highlighted.  相似文献   

12.
周士新  孙啸  陆祖宏 《遗传》2004,26(6):984-990
含有同源异型结构(homeobox)的蛋白质是一大类DNA结合蛋白,在胚胎发育、基因表达调节、细胞分化、神经发生等方面发挥重要作用。近年来发现了同源异型框与其它结构域同时存在,如PAX、POU、LIM、OAR、CUT、ELK、bZIP、SIX、PHD-finger、Engrailed等,近来还发现它通过基因融合或基因失调控方式参与肿瘤的发生。本文对这些含有复合同源框的蛋白质和基因的类型、结构、功能等方面的研究进展进行综述。  相似文献   

13.
RNase II is a single-stranded-specific 3'-exoribonuclease that degrades RNA generating 5'-mononucleotides. This enzyme is the prototype of an ubiquitous family of enzymes that are crucial in RNA metabolism and share a similar domain organization. By sequence prediction, three different domains have been assigned to the Escherichia coli RNase II: two RNA-binding domains at each end of the protein (CSD and S1), and a central RNB catalytic domain. In this work we have performed a functional characterization of these domains in order to address their role in the activity of RNase II. We have constructed a large set of RNase II truncated proteins and compared them to the wild-type regarding their exoribonucleolytic activity and RNA-binding ability. The dissociation constants were determined using different single- or double-stranded substrates. The results obtained revealed that S1 is the most important domain in the establishment of stable RNA-protein complexes, and its elimination results in a drastic reduction on RNA-binding ability. In addition, we also demonstrate that the N-terminal CSD plays a very specific role in RNase II, preventing a tight binding of the enzyme to single-stranded poly(A) chains. Moreover, the biochemical results obtained with RNB mutant that lacks both putative RNA-binding domains, revealed the presence of an additional region involved in RNA binding. Such region, was identified by sequence analysis and secondary structure prediction as a third putative RNA-binding domain located at the N-terminal part of RNB catalytic domain.  相似文献   

14.
Ortholog identification is used in gene functional annotation, species phylogeny estimation, phylogenetic profile construction and many other analyses. Bioinformatics methods for ortholog identification are commonly based on pairwise protein sequence comparisons between whole genomes. Phylogenetic methods of ortholog identification have also been developed; these methods can be applied to protein data sets sharing a common domain architecture or which share a single functional domain but differ outside this region of homology. While promiscuous domains represent a challenge to all orthology prediction methods, overall structural similarity is highly correlated with proximity in a phylogenetic tree, conferring a degree of robustness to phylogenetic methods. In this article, we review the issues involved in orthology prediction when data sets include sequences with structurally heterogeneous domain architectures, with particular attention to automated methods designed for high-throughput application, and present a case study to illustrate the challenges in this area.  相似文献   

15.
A significant proportion of proteins comprise multiple domains. Domain–domain docking is a tool that predicts multi-domain protein structures when individual domain structures can be accurately predicted but when domain orientations cannot be predicted accurately. GalaxyDomDock predicts an ensemble of domain orientations from given domain structures by docking. Such information would also be beneficial in elucidating the functions of proteins that have multiple states with different domain orientations. GalaxyDomDock is an ab initio domain–domain docking method based on GalaxyTongDock, a previously developed protein–protein docking method. Infeasible domain orientations for the given linker are effectively screened out from the docked conformations by a geometric filter, using the Dijkstra algorithm. In addition, domain linker conformations are predicted by adopting a loop sampling method FALC. The proposed GalaxyDomDock outperformed existing ab initio domain–domain docking methods, such as AIDA and Rosetta, in performance tests on the Rosetta benchmark set of two-domain proteins. GalaxyDomDock also performed better than or comparable to AIDA on the AIDA benchmark set of two-domain proteins and two-domain proteins containing discontinuous domains, including the benchmark set in which each domain of the set was modeled by the recent version of AlphaFold. The GalaxyDomDock web server is freely available as a part of GalaxyWEB at http://galaxy.seoklab.org/domdock.  相似文献   

16.
17.
The immunoglobulin (Ig) fold is one of the most important structures in biology, playing essential roles in the vertebrate immune response, cell adhesion, and many other processes. Through bioinformatic analysis, we have discovered that Ig-like domains are often found in the constituent proteins of tailed double-stranded (ds) DNA bacteriophage particles, and are likely displayed on the surface of these viruses. These phage Ig-like domains fall into three distinct sequence families, which are similar to the classic immunoglobulin domain (I-Set), the fibronectin type 3 repeat (FN3), and the bacterial Ig-like domain (Big2). The phage Ig-like domains are very promiscuous. They are attached to more than ten different functional classes of proteins, and found in all three morphogenetic classes of tailed dsDNA phages. In addition, they reside in phages that infect a diverse set of gram negative and gram positive bacteria. These domains are deceptive because many are added to larger proteins through programmed ribosomal frameshifting, so that they are not always detected by standard protein sequence searching procedures. In addition, the presence of unrecognized Ig-like domains in a variety of phage proteins with different functions has led to gene misannotation. Our results demonstrate that horizontal gene transfer involving Ig-like domain encoding DNA has occurred commonly between diverse classes of both lytic and temperate phages, which otherwise display very limited sequence similarities to one another. We suggest that phage may have been an important vector in the spread of Ig-like domains through diverse species of bacteria. While the function of the phage Ig-like domains is unknown, several lines of evidence suggest that they may play an accessory role in phage infection by weakly interacting with carbohydrates on the bacterial cell surface.  相似文献   

18.
Understanding the dynamics behind domain architecture evolution is of great importance to unravel the functions of proteins. Complex architectures have been created throughout evolution by rearrangement and duplication events. An interesting question is how many times a particular architecture has been created, a form of convergent evolution or domain architecture reinvention. Previous studies have approached this issue by comparing architectures found in different species. We wanted to achieve a finer-grained analysis by reconstructing protein architectures on complete domain trees. The prevalence of domain architecture reinvention in 96 genomes was investigated with a novel domain tree-based method that uses maximum parsimony for inferring ancestral protein architectures. Domain architectures were taken from Pfam. To ensure robustness, we applied the method to bootstrap trees and only considered results with strong statistical support. We detected multiple origins for 12.4% of the scored architectures. In a much smaller data set, the subset of completely domain-assigned proteins, the figure was 5.6%. These results indicate that domain architecture reinvention is a much more common phenomenon than previously thought. We also determined which domains are most frequent in multiply created architectures and assessed whether specific functions could be attributed to them. However, no strong functional bias was found in architectures with multiple origins.  相似文献   

19.
The C2 domain is a Ca(2+)-binding motif of approximately 130 residues in length originally identified in the Ca(2+)-dependent isoforms of protein kinase C. Single and multiple copies of C2 domains have been identified in a growing number of eukaryotic signalling proteins that interact with cellular membranes and mediate a broad array of critical intracellular processes, including membrane trafficking, the generation of lipid-second messengers, activation of GTPases, and the control of protein phosphorylation. As a group, C2 domains display the remarkable property of binding a variety of different ligands and substrates, including Ca2+, phospholipids, inositol polyphosphates, and intracellular proteins. Expanding this functional diversity is the fact that not all proteins containing C2 domains are regulated by Ca2+, suggesting that some C2 domains may play a purely structural role or may have lost the ability to bind Ca2+. The present review summarizes the information currently available regarding the structure and function of the C2 domain and provides a novel sequence alignment of 65 C2 domain primary structures. This alignment predicts that C2 domains form two distinct topological folds, illustrated by the recent crystal structures of C2 domains from synaptotagmin 1 and phosphoinositide-specific phospholipase C-delta 1, respectively. The alignment highlights residues that may be critical to the C2 domain fold or required for Ca2+ binding and regulation.  相似文献   

20.
Protein N-glycosylation plays an important role in protein function. Yet, at present, few computational methods are available for the prediction of this protein modification. This prompted our development of a support vector machine (SVM)-based method for this task, as well as a partial least squares (PLS) regression based prediction method for comparison. A functional domain feature space was used to create SVM and PLS models, which achieved accuracies of 83.91% and 79.89%, respectively, as evaluated by a leave-one-out cross-validation. Subsequently, SVM and PLS models were developed based on functional domain and protein secretion information, which yielded accuracies of 89.13% and 86%, respectively. This analysis demonstrates that the protein functional domain and secretion information are both efficient predictors of N-glycosylation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号