首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We describe a method to identify protein domain boundaries from sequence information alone based on the assumption that hydrophobic residues cluster together in space. SnapDRAGON is a suite of programs developed to predict domain boundaries based on the consistency observed in a set of alternative ab initio three-dimensional (3D) models generated for a given protein multiple sequence alignment. This is achieved by running a distance geometry-based folding technique in conjunction with a 3D-domain assignment algorithm. The overall accuracy of our method in predicting the number of domains for a non-redundant data set of 414 multiple alignments, representing 185 single and 231 multiple-domain proteins, is 72.4 %. Using domain linker regions observed in the tertiary structures associated with each query alignment as the standard of truth, inter-domain boundary positions are delineated with an accuracy of 63.9 % for proteins comprising continuous domains only, and 35.4 % for proteins with discontinuous domains. Overall, domain boundaries are delineated with an accuracy of 51.8 %. The prediction accuracy values are independent of the pair-wise sequence similarities within each of the alignments. These results demonstrate the capability of our method to delineate domains in protein sequences associated with a wide variety of structural domain organisation.  相似文献   

2.
Current methods for identification of domains within protein sequences require either structural information or the identification of homologous domain sequences in different sequence contexts. Knowledge of structural domain boundaries is important for fold recognition experiments and structural determination by X-ray crystallography or nuclear magnetic resonance spectroscopy using the divide-and-conquer approach. Here, a new and conceptually simple method for the identification of structural domain boundaries in multiple protein sequence alignments is presented. Analysis of covariance at positions within the alignment is first used to predict 3D contacts. By the nature of the domain as an independent folding unit, inter-domain predicted contacts are fewer than intra-domain predicted contacts. By analysing all possible domain boundaries and constructing a smoothed profile of predicted contact density (PCD), true structural domain boundaries are predicted as local profile minima associated with low PCD. A training data set is constructed from 52 non-homologous two-domain protein sequences of known 3D structure and used to determine optimal parameters for the profile analysis. The alignments in the training data set contained 48 +/- 17 (mean +/- SD) sequences and lengths of 257 +/- 121 residues. Of the 47 alignments yielding predictions, 35% of true domain boundaries are predicted to within 15 amino acids by the local profile minimum with the lowest profile value. Including predictions from the second- and third-lowest local minima increases the correct domain boundary coverage to 60%, whereas the lowest five local minima cover 79% of correct domain boundaries. Through further profile analysis, criteria are presented which reliably identify subsets of more accurate predictions. Retrospective analysis of CASP3 targets shows predictions of sufficient accuracy to enable dramatically improved fold recognition results. Finally, a prediction is made for geminivirus AL1 protein which is in full agreement with biochemical data, yielding a plausible, novel threading result.  相似文献   

3.
Domains are considered as the basic units of protein folding, evolution, and function. Decomposing each protein into modular domains is thus a basic prerequisite for accurate functional classification of biological molecules. Here, we present ADDA, an automatic algorithm for domain decomposition and clustering of all protein domain families. We use alignments derived from an all-on-all sequence comparison to define domains within protein sequences based on a global maximum likelihood model. In all, 90% of domain boundaries are predicted within 10% of domain size when compared with the manual domain definitions given in the SCOP database. A representative database of 249,264 protein sequences were decomposed into 450,462 domains. These domains were clustered on the basis of sequence similarities into 33,879 domain families containing at least two members with less than 40% sequence identity. Validation against family definitions in the manually curated databases SCOP and PFAM indicates almost perfect unification of various large domain families while contamination by unrelated sequences remains at a low level. The global survey of protein-domain space by ADDA confirms that most large and universal domain families are already described in PFAM and/or SMART. However, a survey of the complete set of mobile modules leads to the identification of 1479 new interesting domain families which shuffle around in multi-domain proteins. The data are publicly available at ftp://ftp.ebi.ac.uk/pub/contrib/heger/adda.  相似文献   

4.
The delineation of domain boundaries of a given sequence in the absence of known 3D structures or detectable sequence homology to known domains benefits many areas in protein science, such as protein engineering, protein 3D structure determination and protein structure prediction. With the exponential growth of newly determined sequences, our ability to predict domain boundaries rapidly and accurately from sequence information alone is both essential and critical from the viewpoint of gene function annotation. Anyone attempting to predict domain boundaries for a single protein sequence is invariably confronted with a plethora of databases that contain boundary information available from the internet and a variety of methods for domain boundary prediction. How are these derived and how well do they work? What definition of 'domain' do they use? We will first clarify the different definitions of protein domains, and then describe the available public databases with domain boundary information. Finally, we will review existing domain boundary prediction methods and discuss their strengths and weaknesses.  相似文献   

5.
The elucidation of the domain content of a given protein sequence in the absence of determined structure or significant sequence homology to known domains is an important problem in structural biology. Here we address how successfully the delineation of continuous domains can be accomplished in the absence of sequence homology using simple baseline methods, an existing prediction algorithm (Domain Guess by Size), and a newly developed method (DomSSEA). The study was undertaken with a view to measuring the usefulness of these prediction methods in terms of their application to fully automatic domain assignment. Thus, the sensitivity of each domain assignment method was measured by calculating the number of correctly assigned top scoring predictions. We have implemented a new continuous domain identification method using the alignment of predicted secondary structures of target sequences against observed secondary structures of chains with known domain boundaries as assigned by Class Architecture Topology Homology (CATH). Taking top predictions only, the success rate of the method in correctly assigning domain number to the representative chain set is 73.3%. The top prediction for domain number and location of domain boundaries was correct for 24% of the multidomain set (+/-20 residues). These results have been put into context in relation to the results obtained from the other prediction methods assessed.  相似文献   

6.
Choanoflagellates are considered to be the closest living unicellular relatives of metazoans. The genome of the choanoflagellate Monosiga brevicollis contains a surprisingly high number and diversity of tyrosine kinases, tyrosine phosphatases, and phosphotyrosine-binding domains. Many of the tyrosine kinases possess combinations of domains that have not been observed in any multicellular organism. The role of these protein interaction domains in M. brevicollis kinase signaling is not clear. Here, we have carried out a biochemical characterization of Monosiga HMTK1, a protein containing a putative PTB domain linked to a tyrosine kinase catalytic domain. We cloned, expressed, and purified HMTK1, and we demonstrated that it possesses tyrosine kinase activity. We used immobilized peptide arrays to define a preferred ligand for the third PTB domain of HMTK1. Peptide sequences containing this ligand sequence are phosphorylated efficiently by recombinant HMTK1, suggesting that the PTB domain of HMTK1 has a role in substrate recognition analogous to the SH2 and SH3 domains of mammalian Src family kinases. We suggest that the substrate recruitment function of the noncatalytic domains of tyrosine kinases arose before their roles in autoinhibition.  相似文献   

7.
MOTIVATION: Although many methods are available for the identification of structural domains from protein three-dimensional structures, accurate definition of protein domains and the curation of such data for a large number of proteins are often possible only after manual intervention. The availability of domain definitions for protein structural entries is useful for the sequence analysis of aligned domains, structure comparison, fold recognition procedures and understanding protein folding, domain stability and flexibility. RESULTS: We have improved our method of domain identification starting from the concept of clustering secondary structural elements, but with an intention of reducing the number of discontinuous segments in identified domains. The results of our modified and automatic approach have been compared with the domain definitions from other databases. On a test data set of 55 proteins, this method acquires high agreement (88%) in the number of domains with the crystallographers' definition and resources such as SCOP, CATH, DALI, 3Dee and PDP databases. This method also obtains 98% overlap score with the other resources in the definition of domain boundaries of the 55 proteins. We have examined the domain arrangements of 4592 non-redundant protein chains using the improved method to include 5409 domains leading to an update of the structural domain database. AVAILABILITY: The latest version of the domain database and online domain identification methods are available from http://www.ncbs.res.in/~faculty/mini/ddbase/ddbase.html Supplementary information: http://www.ncbs.res.in/~faculty/mini/ddbase/supplementary/supplementary.html  相似文献   

8.
The identification and annotation of protein domains provides a critical step in the accurate determination of molecular function. Both computational and experimental methods of protein structure determination may be deterred by large multi-domain proteins or flexible linker regions. Knowledge of domains and their boundaries may reduce the experimental cost of protein structure determination by allowing researchers to work on a set of smaller and possibly more successful alternatives. Current domain prediction methods often rely on sequence similarity to conserved domains and as such are poorly suited to detect domain structure in poorly conserved or orphan proteins. We present here a simple computational method to identify protein domain linkers and their boundaries from sequence information alone. Our domain predictor, Armadillo (http://armadillo.blueprint.org), uses any amino acid index to convert a protein sequence to a smoothed numeric profile from which domains and domain boundaries may be predicted. We derived an amino acid index called the domain linker propensity index (DLI) from the amino acid composition of domain linkers using a non-redundant structure dataset. The index indicates that Pro and Gly show a propensity for linker residues while small hydrophobic residues do not. Armadillo predicts domain linker boundaries from Z-score distributions and obtains 35% sensitivity with DLI in a two-domain, single-linker dataset (within +/-20 residues from linker). The combination of DLI and an entropy-based amino acid index increases the overall Armadillo sensitivity to 56% for two domain proteins. Moreover, Armadillo achieves 37% sensitivity for multi-domain proteins, surpassing most other prediction methods. Armadillo provides a simple, but effective method by which prediction of domain boundaries can be obtained with reasonable sensitivity. Armadillo should prove to be a valuable tool for rapidly delineating protein domains in poorly conserved proteins or those with no sequence neighbors. As a first-line predictor, domain meta-predictors could yield improved results with Armadillo predictions.  相似文献   

9.
ABC transporters are a large superfamily of integral membrane proteins involved in ATP-dependent transport across biological membranes. Members of this superfamily play roles in a number of phenomena of biomedical interest, including cystic fibrosis (CFTR) and multidrug resistance (P-glycoprotein, MRP). Most ABC transporters are predicted to consist of four domains, two membrane-spanning domains and two cytoplasmic domains. The latter contain conserved nucleotide-binding motifs. Attempts to determine the structure of ABC transporters and of their separate domains are in progress but have not yet been successful. To aid structure determination and possibly learn more about the domain boundaries, we set out to model nucleotide-binding domains (NBDs) of ABC transporters based on a known structure. Previous attempts to predict the 3D structure of NBDs were based solely on sequence similarity with known nucleotide-binding folds. We have analyzed the sequences of a number of nucleotide-binding domains with the algorithm THREADER, developed by D.T. Jones, and a possible fold was found in the structure of aspartate aminotransferase. We present a model for the N-terminal NBD of CFTR, based on the large domain of the A chain of aspartate aminotransferase. The model is refined using multiple sequence alignment, secondary structure prediction, and 3D-1D profiles. Our model seems to be in good agreement with known properties of nucleotide-binding domains and has some appealing characteristics compared with the previous models. Proteins 30:275–286, 1998. © 1998 Wiley-Liss, Inc.  相似文献   

10.

Background

The proportion of conserved DNA sequences with no clear function is steadily growing in bioinformatics databases. Studies of sequence and structural homology have indicated that many uncharacterized protein domain sequences are variants of functionally described domains. If these variants promote an organism''s ecological fitness, they are likely to be conserved in the genome of its progeny and the population at large. The genetic composition of microbial communities in their native ecosystems is accessible through metagenomics. We hypothesize the co-variation of protein domain sequences across metagenomes from similar ecosystems will provide insights into their potential roles and aid further investigation.

Methodology/Principal findings

We calculated the correlation of Pfam protein domain sequences across the Global Ocean Sampling metagenome collection, employing conservative detection and correlation thresholds to limit results to well-supported hits and associations. We then examined intercorrelations between domains of unknown function (DUFs) and domains involved in known metabolic pathways using network visualization and cluster-detection tools. We used a cautious “guilty-by-association” approach, referencing knowledge-level resources to identify and discuss associations that offer insight into DUF function. We observed numerous DUFs associated to photobiologically active domains and prevalent in the Cyanobacteria. Other clusters included DUFs associated with DNA maintenance and repair, inorganic nutrient metabolism, and sodium-translocating transport domains. We also observed a number of clusters reflecting known metabolic associations and cases that predicted functional reclassification of DUFs.

Conclusion/Significance

Critically examining domain covariation across metagenomic datasets can grant new perspectives on the roles and associations of DUFs in an ecological setting. Targeted attempts at DUF characterization in the laboratory or in silico may draw from these insights and opportunities to discover new associations and corroborate existing ones will arise as more large-scale metagenomic datasets emerge.  相似文献   

11.
The number of amino acid residues contained in the S1 ribosomal protein of various bacteria varies in a wide range: from 111 to 863 residues in Spiroplasma kunkelii and Treponema pallidum, respectively. The architecture of this protein is traditionally (in particular, because of unknown spatial structure) represented as repeated S1 domains, the copy number of which depends on the protein length. The data on the copy number and boundaries of these domains is available in specialized databases, such as SMART, Pfam, and PROSITE; however, these data can be rather different for the same object. In this work, we used the approach utilizing analysis of predicted secondary structure (PsiPred program). This allowed us to detect the structural domains in S1 protein sequences; their copy number varied from one to six. Alignment of the S1 proteins containing different numbers of domains with the S1 RNA-binding domain of Escherichia coli polynucleotide phosphorylase provided for discovering a domain within this family displaying the maximal homology to the E. coli domain. This conservative domain migrates along the chain, and its location in the proteins with different numbers of domains follows a certain pattern. Similar to the S1 domain of polynucleotide phosphorylase, residues Phe19, Phe22, His34, Asp64, and Arg68 in this conservative domain are clustered on the surface to form an RNA-binding site.  相似文献   

12.
Structure of the gene for human coagulation factor V.   总被引:22,自引:0,他引:22  
L D Cripe  K D Moore  W H Kane 《Biochemistry》1992,31(15):3777-3785
Activated factor V (Va) serves as an essential protein cofactor for the conversion of prothrombin to thrombin by factor Xa. Analysis of the factor V cDNA indicates that the protein contains several types of internal repeats with the following domain structure: A1-A2-B-A3-C1-C2. In this report we describe the isolation and characterization of genomic DNA coding for human factor V. The factor V gene contains 25 exons which range in size from 72 to 2820 bp. The structure of the gene for factor V is similar to the previously characterized gene for factor VIII. Based on the aligned amino acid sequences of the two proteins, 21 of the 24 intron-exon boundaries in the factor V gene occur at the same location as in the factor VIII gene. In both genes, the junctions of the A1-A2 and A2-A3 domains are each encoded by a single exon. In contrast, the boundaries between domains A3-C1 and C1-C2 occur at intron-exon boundaries, which is consistent with evolution through domain duplication and exon shuffling. The connecting region or B domain of factor V is encoded by a single large exon of 2820 bp. The corresponding exon of the factor VIII gene contains 3106 bp. The 5' and 3' ends of both of these exons encode sequences homologous to the carboxyl-terminal end of domain A2 and the amino-terminal end of domain A3 in ceruloplasmin. There is otherwise no homology between the B domain exons.(ABSTRACT TRUNCATED AT 250 WORDS)  相似文献   

13.
BackgroundProtein domains display a range of structural diversity, with numerous additions and deletions of secondary structural elements between related domains. We have observed a small number of cases of surprising large-scale deletions of core elements of structural domains. We propose a new concept called domain atrophy, where protein domains lose a significant number of core structural elements.ResultsHere, we implement a new pipeline to systematically identify new cases of domain atrophy across all known protein sequences. The output of this pipeline was carefully checked by hand, which filtered out partial domain instances that were unlikely to represent true domain atrophy due to misannotations or un-annotated sequence fragments. We identify 75 cases of domain atrophy, of which eight cases are found in a three-dimensional protein structure and 67 cases have been inferred based on mapping to a known homologous structure. Domains with structural variations include ancient folds such as the TIM-barrel and Rossmann folds. Most of these domains are observed to show structural loss that does not affect their functional sites.ConclusionOur analysis has significantly increased the known cases of domain atrophy. We discuss specific instances of domain atrophy and see that there has often been a compensatory mechanism that helps to maintain the stability of the partial domain. Our study indicates that although domain atrophy is an extremely rare phenomenon, protein domains under certain circumstances can tolerate extreme mutations giving rise to partial, but functional, domains.

Electronic supplementary material

The online version of this article (doi:10.1186/s13059-015-0655-8) contains supplementary material, which is available to authorized users.  相似文献   

14.
Folmer RH  Geschwindner S  Xue Y 《Biochemistry》2002,41(48):14176-14184
The protein kinase ZAP-70 is involved in T-cell activation, and interacts with tyrosine-phosphorylated peptide sequences known as immunoreceptor tyrosine activation motifs (ITAMs), which are present in three of the subunits of the T-cell receptor. We have studied the tandem SH2 (tSH2) domains of ZAP-70, by both X-ray and NMR. Here, we present the crystal structure of the apoprotein, i.e., the tSH2 domain in the absence of ITAM. Comparison with the previously reported complex structure reveals that binding to the ITAM peptide induces surprisingly large movements between the two SH2 domains and within the actual binding sites. The conformation of the ITAM-free protein is partly governed by a hydrophobic cluster between the linker region and the C-terminal SH2 domain. Our data suggest that the two SH2 domains are able to undergo large interdomain movements. The proposed relative flexibility of the SH2 domains is further supported by the finding that no NMR signals could be detected for the two helices connecting the SH2 domains; these are likely to be broadened beyond detection due to conformational exchange. It is likely that this conformational reorientation induced by ITAM binding is the main signaling event activating the kinase domain in ZAP-70. Another NMR observation was that the N-terminal SH2 domain could bind tetrapeptides derived from the ITAM sequence, apparently without the need to interact with the C-terminal domain. In contrast, the C-terminal domain has little affinity for tetrapeptides. The opposite situation is true for binding to plain phosphotyrosine, where the C-terminal domain has a higher affinity. Distinct features in the crystal structure, showing the interdependence of both domains, explain these binding data.  相似文献   

15.

Motivation

The precise prediction of protein domains, which are the structural, functional and evolutionary units of proteins, has been a research focus in recent years. Although many methods have been presented for predicting protein domains and boundaries, the accuracy of predictions could be improved.

Results

In this study we present a novel approach, DomHR, which is an accurate predictor of protein domain boundaries based on a creative hinge region strategy. A hinge region was defined as a segment of amino acids that covers part of a domain region and a boundary region. We developed a strategy to construct profiles of domain-hinge-boundary (DHB) features generated by sequence-domain/hinge/boundary alignment against a database of known domain structures. The DHB features had three elements: normalized domain, hinge, and boundary probabilities. The DHB features were used as input to identify domain boundaries in a sequence. DomHR used a nonredundant dataset as the training set, the DHB and predicted shape string as features, and a conditional random field as the classification algorithm. In predicted hinge regions, a residue was determined to be a domain or a boundary according to a decision threshold. After decision thresholds were optimized, DomHR was evaluated by cross-validation, large-scale prediction, independent test and CASP (Critical Assessment of Techniques for Protein Structure Prediction) tests. All results confirmed that DomHR outperformed other well-established, publicly available domain boundary predictors for prediction accuracy.

Availability

The DomHR is available at http://cal.tongji.edu.cn/domain/.  相似文献   

16.
In this paper, we describe a neural network analysis of sequences connecting two protein domains (domain linkers). The neural network was trained to distinguish between domain linker sequences and non-linker sequences, using a SCOP-defined domain library. The analysis indicated that a significant difference existed between domain linkers and non-linker regions, including intra-domain loop regions. Moreover, the resulting Hinton diagram showed a position-dependent amino acid preference of the domain linker sequences, and implied their non-random nature. We then applied the neural network to predict domain linkers in multi-domain protein sequences. As the result of a Jack-knife test, 58% of the predicted regions matched actual linker regions (specificity), and 36% of the SCOP-derived domain linkers were predicted (sensitivity). This prediction efficiency is superior to simpler methods derived from secondary structure prediction that assume that long loop regions are putative domain linkers. Altogether, these results suggest that domain linkers possess local characteristics different from those of loop regions.  相似文献   

17.
SH3 domains are common structure, interaction, and regulation modules found in more than 200 human proteins. In this report, we studied the third SH3 domain from the human CIN85 adaptor protein, which plays an important role in both receptor tyrosine kinase downregulation and phosphatidylinositol 3 kinase inhibition. The structure of this domain includes an additional 90° kink after the last canonical β-strand and features unusual interactions between the termini well outside the boundaries of the standard SH3 domain definition. The extended portions of the domain are well-structured and held together entirely by side chain-side chain interactions. Extensive expression screening showed that these additional contacts provide significantly increased stability to the domain. A similar 90° kink is found in only one other SH3 domain structure, while side chain contacts linking the termini have never been described before. As a result of the increased size of CIN85 SH3 domain C, the proximal proline rich region is positioned such that a possible intramolecular interaction is structurally inhibited. Using the key interactions of the termini as the basis for sequence analysis allowed the identification of several SH3 domains with flanking sequences that could adopt similar structures. This work illustrates the importance of careful experimental analysis of domain boundaries even for a well-characterized fold such as the SH3 domain.  相似文献   

18.
The leukocyte-common antigen (L-CA or T200) includes a family of lymphoid and myeloid cell surface glycoproteins with apparent molecular weights from 180,000 to 240,000. We report a partial protein sequence for thymocyte L-CA containing 1073 amino acids predicted from cDNA clones isolated using an oligonucleotide probe. Only one segment (residues 347-368) is likely to cross the membrane, and peptide data suggest that sequences N-terminal to this are outside the cell, with residues 369-1073 inside. The cytoplasmic domain includes possible phosphorylation sites and an internal homology between residues 385-671 and 676-986. Analysis of B lymphocyte cDNA clones suggests that B cell and thymocyte mRNAs are identical in 3' sequences, but size differences in Northern blots suggest 5' sequences may differ.  相似文献   

19.
Sim J  Kim SY  Lee J 《Proteins》2005,59(3):627-632
Successful prediction of protein domain boundaries provides valuable information not only for the computational structure prediction of multidomain proteins but also for the experimental structure determination. Since protein sequences of multiple domains may contain much information regarding evolutionary processes such as gene-exon shuffling, this information can be detected by analyzing the position-specific scoring matrix (PSSM) generated by PSI-BLAST. We have presented a method, PPRODO (Prediction of PROtein DOmain boundaries) that predicts domain boundaries of proteins from sequence information by a neural network. The network is trained and tested using the values obtained from the PSSM generated by PSI-BLAST. A 10-fold cross-validation technique is performed to obtain the parameters of neural networks using a nonredundant set of 522 proteins containing 2 contiguous domains. PPRODO provides good and consistent results for the prediction of domain boundaries, with accuracy of about 66% using the +/-20 residue criterion. The PPRODO source code, as well as all data sets used in this work, are available from http://gene.kias.re.kr/ approximately jlee/pprodo/.  相似文献   

20.
We present heuristic-based predictions of the secondary and tertiary structures of the cyclins A, B, and D, representatives of the cyclin superfamily. The list of suggested constraints for tertiary structure assembly was left unrefined in order to submit this report before an announced crystal structure for cyclin A becomes available. To predict these constraints, a master sequence alignment over 270 positions of cyclin types A, B, and D was adjusted based on individual secondary structure predictions for each type. We used new heuristics for predicting aromatic residues at protein-protein interfaces and to identify sequentially distinct regions in the protein chain that cluster in the folded structure. The boundaries of two conjectured domains in the cyclin fold were predicted based on experimental data in the literature. The domain that is important for interaction of the cyclins with cyclin-dependent kinases (CDKs) is predicted to contain six helices; the second domain in the consensus model contains both helices and a β-sheet that is formed by sequentially distant regions in the protein chain. A plausible phosphorylation site is identified. This work represents a blinded test of the method for prediction of secondary and, to a lesser extent, tertiary structure from a set of homologous protein sequences. Evaluation of our predictions will become possible with the publication of the announced crystal structure.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号