首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 890 毫秒
1.
MOTIVATION: Ideally, only proteins that exhibit highly similar domain architectures should be compared with one another as homologues or be classified into a single family. By combining three different indices, the Jaccard index, the Goodman-Kruskal gamma function and the domain duplicate index, into a single similarity measure, we propose a method for comparing proteins based on their domain architectures. RESULTS: Evaluation of the method using the eukaryotic orthologous groups of proteins (KOGs) database indicated that it allows the automatic and efficient comparison of multiple-domain proteins, which are usually refractory to classic approaches based on sequence similarity measures. As a case study, the PDZ and LRR_1 domains are used to demonstrate how proteins containing promiscuous domains can be clearly compared using our method. For the convenience of users, a web server was set up where three different query interfaces were implemented to compare different domain architectures or proteins with domain(s), and to identify the relationships among domain architectures within a given KOG from the Clusters of Orthologous Groups of Proteins database. Conclusion: The approach we propose is suitable for estimating the similarity of domain architectures of proteins, especially those of multidomain proteins. AVAILABILITY: http://cmb.bnu.edu.cn/pdart/.  相似文献   

2.
MOTIVATION: Although many methods are available for the identification of structural domains from protein three-dimensional structures, accurate definition of protein domains and the curation of such data for a large number of proteins are often possible only after manual intervention. The availability of domain definitions for protein structural entries is useful for the sequence analysis of aligned domains, structure comparison, fold recognition procedures and understanding protein folding, domain stability and flexibility. RESULTS: We have improved our method of domain identification starting from the concept of clustering secondary structural elements, but with an intention of reducing the number of discontinuous segments in identified domains. The results of our modified and automatic approach have been compared with the domain definitions from other databases. On a test data set of 55 proteins, this method acquires high agreement (88%) in the number of domains with the crystallographers' definition and resources such as SCOP, CATH, DALI, 3Dee and PDP databases. This method also obtains 98% overlap score with the other resources in the definition of domain boundaries of the 55 proteins. We have examined the domain arrangements of 4592 non-redundant protein chains using the improved method to include 5409 domains leading to an update of the structural domain database. AVAILABILITY: The latest version of the domain database and online domain identification methods are available from http://www.ncbs.res.in/~faculty/mini/ddbase/ddbase.html Supplementary information: http://www.ncbs.res.in/~faculty/mini/ddbase/supplementary/supplementary.html  相似文献   

3.
Dividing protein structures into domains is proven useful for more accurate structural and functional characterization of proteins. Here, we develop a method, called DDOMAIN, that divides structure into DOMAINs using a normalized contact-based domain-domain interaction profile. Results of DDOMAIN are compared to AUTHORS annotations (domain definitions are given by the authors who solved protein structures), as well as to popular SCOP and CATH annotations by human experts and automatic programs. DDOMAIN's automatic annotations are most consistent with the AUTHORS annotations (90% agreement in number of domains and 88% agreement in both number of domains and at least 85% overlap in domain assignment of residues) if its three adjustable parameters are trained by the AUTHORS annotations. By comparison, the agreement is 83% (81% with at least 85% overlap criterion) between SCOP-trained DDOMAIN and SCOP annotations and 77% (73%) between CATH-trained DDOMAIN and CATH annotations. The agreement between DDOMAIN and AUTHORS annotations goes beyond single-domain proteins (97%, 82%, and 56% for single-, two-, and three-domain proteins, respectively). For an "easy" data set of proteins whose CATH and SCOP annotations agree with each other in number of domains, the agreement is 90% (89%) between "easy-set"-trained DDOMAIN and CATH/SCOP annotations. The consistency between SCOP-trained DDOMAIN and SCOP annotations is superior to two other recently developed, SCOP-trained, automatic methods PDP (protein domain parser), and DomainParser 2. We also tested a simple consensus method made of PDP, DomainParser 2, and DDOMAIN and a different version of DDOMAIN based on a more sophisticated statistical energy function. The DDOMAIN server and its executable are available in the services section on http://sparks.informatics.iupui.edu.  相似文献   

4.
Protein domains exist by themselves or in combination with other domains to form complex multidomain proteins. Defining domain boundaries in proteins is essential for understanding their evolution and function but is not trivial. More specifically, partitioning domains that interact by forming a single β-sheet is known to be particularly troublesome for automatic structure-based domain decomposition pipelines. Here, we study edge-to-edge β-strand interactions between domains in a protein chain, to help define the boundaries for some more difficult cases where a single β-sheet spanning over two domains gives an appearance of one. We give a number of examples where β-strands belonging to a single β-sheet do not belong to a single domain and highlight the difficulties of automatic domain parsers on these examples. This work can be used as a baseline for defining domain boundaries in homologous proteins or proteins with similar domain interactions in the future.  相似文献   

5.
Decomposition of structural domains is an essential task in classifying protein structures, predicting protein function, and many other proteomics problems. As the number of known protein structures in PDB grows exponentially, the need for accurate automatic domain decomposition methods becomes more essential. In this article, we introduce a bottom‐up algorithm for assigning protein domains using a graph theoretical approach. This algorithm is based on a center‐based clustering approach. For constructing initial clusters, members of an independent dominating set for the graph representation of a protein are considered as the centers. A distance matrix is then defined for these clusters. To obtain final domains, these clusters are merged using the compactness principle of domains and a method similar to the neighbor‐joining algorithm considering some thresholds. The thresholds are computed using a training set consisting of 50 protein chains. The algorithm is implemented using C++ language and is named ProDomAs. To assess the performance of ProDomAs, its results are compared with seven automatic methods, against five publicly available benchmarks. The results show that ProDomAs outperforms other methods applied on the mentioned benchmarks. The performance of ProDomAs is also evaluated against 6342 chains obtained from ASTRAL SCOP 1.71. ProDomAs is freely available at http://www.bioinf.cs.ipm.ir/software/prodomas . Proteins 2014; 82:1937–1946. © 2014 Wiley Periodicals, Inc.  相似文献   

6.
7.
Li L  Wu C  Huang H  Zhang K  Gan J  Li SS 《Nucleic acids research》2008,36(10):3263-3273
Systematic identification of binding partners for modular domains such as Src homology 2 (SH2) is important for understanding the biological function of the corresponding SH2 proteins. We have developed a worldwide web-accessible computer program dubbed SMALI for scoring matrix-assisted ligand identification for SH2 domains and other signaling modules. The current version of SMALI harbors 76 unique scoring matrices for SH2 domains derived from screening oriented peptide array libraries. These scoring matrices are used to search a protein database for short peptides preferred by an SH2 domain. An experimentally determined cut-off value is used to normalize an SMALI score, therefore allowing for direct comparison in peptide-binding potential for different SH2 domains. SMALI employs distinct scoring matrices from Scansite, a popular motif-scanning program. Moreover, SMALI contains built-in filters for phosphoproteins, Gene Ontology (GO) correlation and colocalization of subject and query proteins. Compared to Scansite, SMALI exhibited improved accuracy in identifying binding peptides for SH2 domains. Applying SMALI to a group of SH2 domains identified hundreds of interactions that overlap significantly with known networks mediated by the corresponding SH2 proteins, suggesting SMALI is a useful tool for facile identification of signaling networks mediated by modular domains that recognize short linear peptide motifs.  相似文献   

8.
Annotations of the genes and their products are largely guided by inferring homology. Sequence similarity is the primary measure used for annotation purpose however, the domain content and order were given less importance albeit the fact that domain insertion, deletion, positional changes can bring in functional varieties. Of late, several methods developed quantify domain architecture similarity depending on alignments of their sequences and are focused on only homologous proteins. We present an alignment-free domain architecture-similarity search (ADASS) algorithm that identifies proteins that share very poor sequence similarity yet having similar domain architectures. We introduce a “singlet matching-triplet comparison” method in ADASS, wherein triplet of domains is compared with other triplets in a pair-wise comparison of two domain architectures. Different events in the triplet comparison are scored as per a scoring scheme and an average pairwise distance score (Domain Architecture Distance score - DAD Score) is calculated between protein domains architectures. We use domain architectures of a selected domain termed as centric domain and cluster them based on DAD score. The algorithm has high Positive Prediction Value (PPV) with respect to the clustering of the sequences of selected domain architectures. A comparison of domain architecture based dendrograms using ADASS method and an existing method revealed that ADASS can classify proteins depending on the extent of domain architecture level similarity. ADASS is more relevant in cases of proteins with tiny domains having little contribution to the overall sequence similarity but contributing significantly to the overall function.  相似文献   

9.
Yang Y  Zhan J  Zhao H  Zhou Y 《Proteins》2012,80(8):2080-2088
A structure alignment program aligns two structures by optimizing a scoring function that measures structural similarity. It is highly desirable that such scoring function is independent of the sizes of proteins in comparison so that the significance of alignment across different sizes of the protein regions aligned is comparable. Here, we developed a new score called SP‐score that fixes the cutoff distance at 4 Å and removed the size dependence using a normalization prefactor. We further built a program called SPalign that optimizes SP‐score for structure alignment. SPalign was applied to recognize proteins within the same structure fold and having the same function of DNA or RNA binding. For fold discrimination, SPalign improves sensitivity over TMalign for the chain‐level comparison by 12% and over DALI for the domain‐level comparison by 13% at the same specificity of 99.6%. The difference between TMalign and SPalign at the chain level is due to the inability of TMalign to detect single domain similarity between multidomain proteins. For recognizing nucleic acid binding proteins, SPalign consistently improves over TMalign by 12% and DALI by 31% in average value of Mathews correlation coefficients for four datasets. SPalign with default setting is 14% faster than TMalign. SPalign is expected to be useful for function prediction and comparing structures with or without domains defined. The source code for SPalign and the server are available at http://sparks.informatics.iupui.edu . Proteins 2012;. © 2012 Wiley Periodicals, Inc.  相似文献   

10.
Recognition of short linear motifs (SLiMs) or peptides by proteins is an important component of many cellular processes. However, due to limited and degenerate binding motifs, prediction of cellular targets is challenging. In addition, many of these interactions are transient and of relatively low affinity. Here, we focus on one of the largest families of SLiM‐binding domains in the human proteome, the PDZ domain. These domains bind the extreme C‐terminus of target proteins, and are involved in many signaling and trafficking pathways. To predict endogenous targets of PDZ domains, we developed MotifAnalyzer‐PDZ, a program that filters and compares all motif‐satisfying sequences in any publicly available proteome. This approach enables us to determine possible PDZ binding targets in humans and other organisms. Using this program, we predicted and biochemically tested novel human PDZ targets by looking for strong sequence conservation in evolution. We also identified three C‐terminal sequences in choanoflagellates that bind a choanoflagellate PDZ domain, the Monsiga brevicollis SHANK1 PDZ domain (mbSHANK1), with endogenously‐relevant affinities, despite a lack of conservation with the targets of a homologous human PDZ domain, SHANK1. All three are predicted to be signaling proteins, with strong sequence homology to cytosolic and receptor tyrosine kinases. Finally, we analyzed and compared the positional amino acid enrichments in PDZ motif‐satisfying sequences from over a dozen organisms. Overall, MotifAnalyzer‐PDZ is a versatile program to investigate potential PDZ interactions. This proof‐of‐concept work is poised to enable similar types of analyses for other SLiM‐binding domains (e.g., MotifAnalyzer‐Kinase). MotifAnalyzer‐PDZ is available at http://motifAnalyzerPDZ.cs.wwu.edu .  相似文献   

11.
The structure of many proteins consists of a combination of discrete modules that have been shuffled during evolution. Such modules can frequently be recognized from the analysis of homology. Here we present a systematic analysis of the modular organization of all sequenced proteins. To achieve this we have developed an automatic method to identify protein domains from sequence comparisons. Homologous domains can then be clustered into consistent families. The method was applied to all 21,098 nonfragment protein sequences in SWISS-PROT 21.0, which was automatically reorganized into a comprehensive protein domain database, ProDom. We have constructed multiple sequence alignments for each domain family in ProDom, from which consensus sequences were generated. These nonreduntant domain consensuses are useful for fast homology searches. Domain organization in ProDom is exemplified for proteins of the phosphoenolpyruvate:sugar phosphotransferase system (PEP:PTS) and for bacterial 2-component regulators. We provide 2 examples of previously unrecognized domain arrangements discovered with the help of ProDom.  相似文献   

12.
Domains are considered as the basic units of protein folding, evolution, and function. Decomposing each protein into modular domains is thus a basic prerequisite for accurate functional classification of biological molecules. Here, we present ADDA, an automatic algorithm for domain decomposition and clustering of all protein domain families. We use alignments derived from an all-on-all sequence comparison to define domains within protein sequences based on a global maximum likelihood model. In all, 90% of domain boundaries are predicted within 10% of domain size when compared with the manual domain definitions given in the SCOP database. A representative database of 249,264 protein sequences were decomposed into 450,462 domains. These domains were clustered on the basis of sequence similarities into 33,879 domain families containing at least two members with less than 40% sequence identity. Validation against family definitions in the manually curated databases SCOP and PFAM indicates almost perfect unification of various large domain families while contamination by unrelated sequences remains at a low level. The global survey of protein-domain space by ADDA confirms that most large and universal domain families are already described in PFAM and/or SMART. However, a survey of the complete set of mobile modules leads to the identification of 1479 new interesting domain families which shuffle around in multi-domain proteins. The data are publicly available at ftp://ftp.ebi.ac.uk/pub/contrib/heger/adda.  相似文献   

13.
The overall function of a multi‐domain protein is determined by the functional and structural interplay of its constituent domains. Traditional sequence alignment‐based methods commonly utilize domain‐level information and provide classification only at the level of domains. Such methods are not capable of taking into account the contributions of other domains in the proteins, and domain‐linker regions and classify multi‐domain proteins. An alignment‐free protein sequence comparison tool, CLAP (CLAssification of Proteins) was previously developed in our laboratory to especially handle multi‐domain protein sequences without a requirement of defining domain boundaries and sequential order of domains. Through this method we aim to achieve a biologically meaningful classification scheme for multi‐domain protein sequences. In this article, CLAP‐based classification has been explored on 5 datasets of multi‐domain proteins and we present detailed analysis for proteins containing (1) Tyrosine phosphatase and (2) SH3 domain. At the domain‐level CLAP‐based classification scheme resulted in a clustering similar to that obtained from an alignment‐based method. CLAP‐based clusters obtained for full‐length datasets were shown to comprise of proteins with similar functions and domain architectures. Our study demonstrates that multi‐domain proteins could be classified effectively by considering full‐length sequences without a requirement of identification of domains in the sequence.  相似文献   

14.
Protein domain decomposition using a graph-theoretic approach   总被引:2,自引:0,他引:2  
MOTIVATION: Automatic decomposition of a multi-domain protein into individual domains represents a highly interesting and unsolved problem. As the number of protein structures in PDB is growing at an exponential rate, there is clearly a need for more reliable and efficient methods for protein domain decomposition simply to keep the domain databases up-to-date. RESULTS: We present a new algorithm for solving the domain decomposition problem, using a graph-theoretic approach. We have formulated the problem as a network flow problem, in which each residue of a protein is represented as a node of the network and each residue--residue contact is represented as an edge with a particular capacity, depending on the type of the contact. A two-domain decomposition problem is solved by finding a bottleneck (or a minimum cut) of the network, which minimizes the total cross-edge capacity, using the classical Ford--Fulkerson algorithm. A multi-domain decomposition problem is solved through repeatedly solving a series of two-domain problems. The algorithm has been implemented as a computer program, called DomainParser. We have tested the program on a commonly used test set consisting of 55 proteins. The decomposition results are 78.2% in agreement with the literature on both the number of decomposed domains and the assignments of residues to each domain, which compares favorably to existing programs. On the subset of two-domain proteins (20 in number), the program assigned 96.7% of the residues correctly when we require that the number of decomposed domains is two.  相似文献   

15.
Structural genomic projects envision almost routine protein structure determinations, which are currently imaginable only for small proteins with molecular weights below 25,000 Da. For larger proteins, structural insight can be obtained by breaking them into small segments of amino acid sequences that can fold into native structures, even when isolated from the rest of the protein. Such segments are autonomously folding units (AFU) and have sizes suitable for fast structural analyses. Here, we propose to expand an intuitive procedure often employed for identifying biologically important domains to an automatic method for detecting putative folded protein fragments. The procedure is based on the recognition that large proteins can be regarded as a combination of independent domains conserved among diverse organisms. We thus have developed a program that reorganizes the output of BLAST searches and detects regions with a large number of similar sequences. To automate the detection process, it is reduced to a simple geometrical problem of recognizing rectangular shaped elevations in a graph that plots the number of similar sequences at each residue of a query sequence. We used our program to quantitatively corroborate the premise that segments with conserved sequences correspond to domains that fold into native structures. We applied our program to a test data set composed of 99 amino acid sequences containing 150 segments with structures listed in the Protein Data Bank, and thus known to fold into native structures. Overall, the fragments identified by our program have an almost 50% probability of forming a native structure, and comparable results are observed with sequences containing domain linkers classified in SCOP. Furthermore, we verified that our program identifies AFU in libraries from various organisms, and we found a significant number of AFU candidates for structural analysis, covering an estimated 5 to 20% of the genomic databases. Altogether, these results argue that methods based on sequence similarity can be useful for dissecting large proteins into small autonomously folding domains, and such methods may provide an efficient support to structural genomics projects.  相似文献   

16.
L Wernisch  M Hunting  S J Wodak 《Proteins》1999,35(3):338-352
A novel automatic procedure for identifying domains from protein atomic coordinates is presented. The procedure, termed STRUDL (STRUctural Domain Limits), does not take into account information on secondary structures and handles any number of domains made up of contiguous or non-contiguous chain segments. The core algorithm uses the Kernighan-Lin graph heuristic to partition the protein into residue sets which display minimum interactions between them. These interactions are deduced from the weighted Voronoi diagram. The generated partitions are accepted or rejected on the basis of optimized criteria, representing basic expected physical properties of structural domains. The graph heuristic approach is shown to be very effective, it approximates closely the exact solution provided by a branch and bound algorithm for a number of test proteins. In addition, the overall performance of STRUDL is assessed on a set of 787 representative proteins from the Protein Data Bank by comparison to domain definitions in the CATH protein classification. The domains assigned by STRUDL agree with the CATH assignments in at least 81% of the tested proteins. This result is comparable to that obtained previously using PUU (Holm and Sander, Proteins 1994;9:256-268), the only other available algorithm designed to identify domains with any number of non-contiguous chain segments. A detailed discussion of the structures for which our assignments differ from those in CATH brings to light some clear inconsistencies between the concept of structural domains based on minimizing inter-domain interactions and that of delimiting structural motifs that represent acceptable folding topologies or architectures. Considering both concepts as complementary and combining them in a layered approach might be the way forward.  相似文献   

17.
U Linne  S Doekel  M A Marahiel 《Biochemistry》2001,40(51):15824-15834
Incorporation of nonproteinogenic amino acids in small polypeptides synthesized by nonribosomal peptide synthetases (NRPS) significantly contributes to their biological activity. In these peptides, conversion of L-amino acids to the corresponding D-isomer is catalyzed by specialized NRPS modules that utilize an epimerization (E) domain. To understand the basis for the specific interaction of E domains with PCP domains (peptidyl carrier proteins, also described as T domains) and to investigate their substrate tolerance, we constructed a set of eight fusion proteins. The gene fragments encoding E and PCP-E domains of TycA (A-PCP-E), the one module tyrocidine synthetase A, were fused to different gene fragments encoding A and A-PCP domains, resulting in A/PCP-E and A-PCP/E types of fusion proteins (slash indicates site of fusion). We were able to show that the E domain of TycA, usually epimerizing Phe, does also accept the alternate substrates Trp, Ile, and Val, although with reduced efficiency. Interestingly, however, an epimerization activity was only observed in the case of fusion proteins where the PCP domain originates from modules containing an E domain. Sequence comparison revealed that such PCPs possess significant differences in the signature Ppant binding motif (CoreT: [GGDSI]), when compared to those carrier proteins, originating from ordinary C-A-PCP elongation modules (CoreT: [GGHSL]). By means of mutational analysis, we could show that epimerization activity is influenced by the nature of amino acid residues in proximity to the cofactor Ppant binding site. The aspartate residue in front of the invariant serine (Ppant binding site) especially seems to play an important role for the proper interaction between PCP and the E domain, as well as the presentation of the aminoacyl-S-Ppant substrate in the course of substrate epimerization. In conclusion, specialized PCP domains are needed for a productive interaction with E domains when constructing hybrid enzymes.  相似文献   

18.
19.
Calretinin (CR) is a calcium-binding, neuronal protein of undefined function. Related proteins either buffer intracellular calcium concentrations or are involved in calcium-signaling pathways. We transformed three CR gene fragment sequences, corresponding to its three complementary domains (I-II, III-IV, and V-VI), into Pichia pastoris. High yields of extracellular expression, of more than 200 mg/liter, were achieved. Simple purification protocols provide high yields of homogenous proteins: dialysis and DEAE-cellulose chromatography for domains I-II and III-IV or ammonium sulfate precipitation and octyl-Sepharose chromatography for domain V-VI. To our knowledge, this is the first report of the expression of an EF-hand protein using P. pastoris. Direct comparison of the purified yields of domain I-II indicates a approximately 20-fold improvement over Escherichia coli. N-terminal amino acid sequencing confirmed our gene products and two anti-calretinin antibodies recognized the appropriate domains. All three CR domains bind (45)Ca and the domain containing EF-hands V and VI seems to have a lower calcium capacity than the other domains. Circular dichroism indicates a high helix content for each of the domains. Calcium-induced structural changes in the first two domains, followed by tryptophan fluorescence, correspond with previous studies, while tyrosine emission fluorescence indicates calcium-induced structural changes also occur in domain V-VI. The methods and expression levels achieved are suitable for future NMR labeling of the proteins, with (15)N and (13)C, and structure-function studies that will help to further understand CR function.  相似文献   

20.
This analysis takes an in-depth look into the difficulties encountered by automatic methods for domain decomposition from three-dimensional structure. The analysis involves a multi-faceted set of criteria including the integrity of secondary structure elements, the tendency toward fragmentation of domains, domain boundary consistency and topology. The strength of the analysis comes from the use of a new comprehensive benchmark dataset, which is based on consensus among experts (CATH, SCOP and AUTHORS of the 3D structures) and covers 30 distinct architectures and 211 distinct topologies as defined by CATH. Furthermore, over 66% of the structures are multi-domain proteins; each domain combination occurring once per dataset. The performance of four automatic domain assignment methods, DomainParser, NCBI, PDP and PUU, is carefully analyzed using this broad spectrum of topology combinations and knowledge of rules and assumptions built into each algorithm. We conclude that it is practically impossible for an automatic method to achieve the level of performance of human experts. However, we propose specific improvements to automatic methods as well as broadening the concept of a structural domain. Such work is prerequisite for establishing improved approaches to domain recognition. (The benchmark dataset is available from http://pdomains.sdsc.edu).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号