首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
BackgroundProtein domains are commonly used to assess the functional roles and evolutionary relationships of proteins and protein families. Here, we use the Pfam protein family database to examine a set of candidate partial domains. Pfam protein domains are often thought of as evolutionarily indivisible, structurally compact, units from which larger functional proteins are assembled; however, almost 4% of Pfam27 PfamA domains are shorter than 50% of their family model length, suggesting that more than half of the domain is missing at those locations. To better understand the structural nature of partial domains in proteins, we examined 30,961 partial domain regions from 136 domain families contained in a representative subset of PfamA domains (RefProtDom2 or RPD2).ResultsWe characterized three types of apparent partial domains: split domains, bounded partials, and unbounded partials. We find that bounded partial domains are over-represented in eukaryotes and in lower quality protein predictions, suggesting that they often result from inaccurate genome assemblies or gene models. We also find that a large percentage of unbounded partial domains produce long alignments, which suggests that their annotation as a partial is an alignment artifact; yet some can be found as partials in other sequence contexts.ConclusionsPartial domains are largely the result of alignment and annotation artifacts and should be viewed with caution. The presence of partial domain annotations in proteins should raise the concern that the prediction of the protein’s gene may be incomplete. In general, protein domains can be considered the structural building blocks of proteins.

Electronic supplementary material

The online version of this article (doi:10.1186/s13059-015-0656-7) contains supplementary material, which is available to authorized users.  相似文献   

2.
3.
Production of diffracting crystals is a critical step in determining the three-dimensional structure of a protein by X-ray crystallography. Computational techniques to rank proteins by their propensity to yield diffraction-quality crystals can improve efficiency in obtaining structural data by guiding both protein selection and construct design. XANNpred comprises a pair of artificial neural networks that each predict the propensity of a selected protein sequence to produce diffraction-quality crystals by current structural biology techniques. Blind tests show XANNpred has accuracy and Matthews correlation values ranging from 75% to 81% and 0.50 to 0.63 respectively; values of area under the receiver operator characteristic (ROC) curve range from 0.81 to 0.88. On blind test data XANNpred outperforms the other available algorithms XtalPred, PXS, OB-Score, and ParCrys. XANNpred also guides construct design by presenting graphs of predicted propensity for diffraction-quality crystals against residue sequence position. The XANNpred-SG algorithm is likely to be most useful to target selection in structural genomics consortia, while the XANNpred-PDB algorithm is more suited to the general structural biology community. XANNpred predictions that include sliding window graphs are freely available from http://www.compbio.dundee.ac.uk/xannpred  相似文献   

4.
Structural classification of families of membrane proteins by bioinformatics techniques has become a critical aspect of membrane protein research. We have proposed hydropathy profile alignment to identify structural homology between families of membrane proteins. Here, we demonstrate experimentally that two families of secondary transporters, the ESS and 2HCT families, indeed share similar folds. Members of the two families show highly similar hydropathy profiles but cannot be shown to be homologous by sequence similarity. A structural model was predicted for the ESS family transporters based upon an existing model of the 2HCT family transporters. In the model, the transporters fold into two domains containing five transmembrane segments and a reentrant or pore-loop each. The two pore-loops enter the membrane embedded part of the proteins from opposite sides of the membrane. The model was verified by accessibility studies of cysteine residues in single-Cys mutants of the Na+-glutamate transporter GltS of Escherichia coli, a member of the ESS family. Cysteine residues positioned in predicted periplasmic loops were accessible from the periplasm by a bulky, membrane-impermeable thiol reagent, while cysteine residues in cytoplasmic loops were not. Furthermore, two cysteine residues in the predicted pore-loop entering the membrane from the cytoplasmic side were shown to be accessible for small, membrane-impermeable thiol reagents from the periplasm, as was demonstrated before for the Na+-citrate transporter CitS of Klebsiella pneumoniae, a member of the 2HCT family. The data strongly suggests that GltS of the ESS family and CitS of the 2HCT family share the same fold as was predicted by comparing the averaged hydropathy profiles of the two families.  相似文献   

5.
Standley DM  Toh H  Nakamura H 《Proteins》2008,72(4):1333-1351
A method to functionally annotate structural genomics targets, based on a novel structural alignment scoring function, is proposed. In the proposed score, position-specific scoring matrices are used to weight structurally aligned residue pairs to highlight evolutionarily conserved motifs. The functional form of the score is first optimized for discriminating domains belonging to the same Pfam family from domains belonging to different families but the same CATH or SCOP superfamily. In the optimization stage, we consider four standard weighting functions as well as our own, the "maximum substitution probability," and combinations of these functions. The optimized score achieves an area of 0.87 under the receiver-operating characteristic curve with respect to identifying Pfam families within a sequence-unique benchmark set of domain pairs. Confidence measures are then derived from the benchmark distribution of true-positive scores. The alignment method is next applied to the task of functionally annotating 230 query proteins released to the public as part of the Protein 3000 structural genomics project in Japan. Of these queries, 78 were found to align to templates with the same Pfam family as the query or had sequence identities > or = 30%. Another 49 queries were found to match more distantly related templates. Within this group, the template predicted by our method to be the closest functional relative was often not the most structurally similar. Several nontrivial cases are discussed in detail. Finally, 103 queries matched templates at the fold level, but not the family or superfamily level, and remain functionally uncharacterized.  相似文献   

6.
7.
Selection of protein targets for study is central to structural biology and may be influenced by numerous factors. A key aim is to maximise returns for effort invested by identifying proteins with the balance of biophysical properties that are conducive to success at all stages (e.g. solubility, crystallisation) in the route towards a high resolution structural model. Selected targets can be optimised through construct design (e.g. to minimise protein disorder), switching to a homologous protein, and selection of experimental methodology (e.g. choice of expression system) to prime for efficient progress through the structural proteomics pipeline. Here we discuss computational techniques in target selection and optimisation, with more detailed focus on tools developed within the Scottish Structural Proteomics Facility (SSPF); namely XANNpred, ParCrys, OB-Score (target selection) and TarO (target optimisation). TarO runs a large number of algorithms, searching for homologues and annotating the pool of possible alternative targets. This pool of putative homologues is presented in a ranked, tabulated format and results are also visualised as an automatically generated and annotated multiple sequence alignment. The target selection algorithms each predict the propensity of a selected protein target to progress through the experimental stages leading to diffracting crystals. This single predictor approach has advantages for target selection, when compared with an approach using two or more predictors that each predict for success at a single experimental stage. The tools described here helped SSPF achieve a high (21%) success rate in progressing cloned targets to diffraction-quality crystals.  相似文献   

8.
MOTIVATION: A method for recognizing the three-dimensional fold from the protein amino acid sequence based on a combination of hidden Markov models (HMMs) and secondary structure prediction was recently developed for proteins in the Mainly-Alpha structural class. Here, this methodology is extended to Mainly-Beta and Alpha-Beta class proteins. Compared to other fold recognition methods based on HMMs, this approach is novel in that only secondary structure information is used. Each HMM is trained from known secondary structure sequences of proteins having a similar fold. Secondary structure prediction is performed for the amino acid sequence of a query protein. The predicted fold of a query protein is the fold described by the model fitting the predicted sequence the best. RESULTS: After model cross-validation, the success rate on 44 test proteins covering the three structural classes was found to be 59%. On seven fold predictions performed prior to the publication of experimental structure, the success rate was 71%. In conclusion, this approach manages to capture important information about the fold of a protein embedded in the length and arrangement of the predicted helices, strands and coils along the polypeptide chain. When a more extensive library of HMMs representing the universe of known structural families is available (work in progress), the program will allow rapid screening of genomic databases and sequence annotation when fold similarity is not detectable from the amino acid sequence. AVAILABILITY: FORESST web server at http://absalpha.dcrt.nih.gov:8008/ for the library of HMMs of structural families used in this paper. FORESST web server at http://www.tigr.org/ for a more extensive library of HMMs (work in progress). CONTACT: valedf@tigr.org; munson@helix.nih.gov; garnier@helix.nih.gov  相似文献   

9.
GeMMA (Genome Modelling and Model Annotation) is a new approach to automatic functional subfamily classification within families and superfamilies of protein sequences. A major advantage of GeMMA is its ability to subclassify very large and diverse superfamilies with tens of thousands of members, without the need for an initial multiple sequence alignment. Its performance is shown to be comparable to the established high-performance method SCI-PHY. GeMMA follows an agglomerative clustering protocol that uses existing software for sensitive and accurate multiple sequence alignment and profile–profile comparison. The produced subfamilies are shown to be equivalent in quality whether whole protein sequences are used or just the sequences of component predicted structural domains. A faster, heuristic version of GeMMA that also uses distributed computing is shown to maintain the performance levels of the original implementation. The use of GeMMA to increase the functional annotation coverage of functionally diverse Pfam families is demonstrated. It is further shown how GeMMA clusters can help to predict the impact of experimentally determining a protein domain structure on comparative protein modelling coverage, in the context of structural genomics.  相似文献   

10.
In Archaea, splicing endonuclease (EndA) recognizes and cleaves precursor RNAs to remove introns. Currently, EndAs are classified into three families according to their subunit structures: homotetramer, homodimer, and heterotetramer. The crenarchaeal heterotetrameric EndAs can be further classified into two subfamilies based on the size of the structural subunit. Subfamily A possesses a structural subunit similar in size to the catalytic subunit, whereas subfamily B possesses a structural subunit significantly smaller than the catalytic subunit. Previously, we solved the crystal structure of an EndA from Pyrobaculum aerophilum. The endonuclease was classified into subfamily B, and the structure revealed that the enzyme lacks an N-terminal subdomain in the structural subunit. However, no structural information is available for crenarchaeal heterotetrameric EndAs that are predicted to belong to subfamily A. Here, we report the crystal structure of the EndA from Aeropyrum pernix, which is predicted to belong to subfamily A. The enzyme possesses the N-terminal subdomain in the structural subunit, revealing that the two subfamilies of heterotetrameric EndAs are structurally distinct. EndA from A. pernix also possesses an extra loop region that is characteristic of crenarchaeal EndAs. Our mutational study revealed that the conserved lysine residue in the loop is important for endonuclease activity. Furthermore, the sequence characteristics of the loops and the positions towards the substrate RNA according to a docking model prompted us to propose that crenarchaea-specific loops and an extra amino acid sequence at the catalytic loop of nanoarchaeal EndA are derived by independent convergent evolution and function for recognizing noncanonical bulge-helix-bulge motif RNAs as substrates.  相似文献   

11.
The MemGen structural classification of membrane proteins groups families of proteins by hydropathy profile alignment. Class ST[3] of the MemGen classification contains 32 families of transporter proteins including the IT superfamily. Transporters from 19 different families in class ST[3] were evaluated by the TopScreen experimental topology screening method to verify the structural classification by MemGen. TopScreen involves the determination of the cellular disposition of three sites in the polypeptide chain of the proteins which allows for discrimination between different topology models. For nearly all transporters at least one of the predicted localizations is different in the models produced by MemGen and predictor TMHMM. Comparison to the experimental data showed that in all cases the prediction by MemGen was correct. It is concluded that the structural model available for transporters of the [st324]ESS and [st326]2HCT families is also valid for the other families in class ST[3]. The core structure of the model consists of two homologous domains, each containing 5 transmembrane segments, which have an opposite orientation in the membrane. A reentrant loop is present in between the 4th and 5th segments in each domain. Nearly all of the identified and experimentally confirmed structural variations involve additions of transmembrane segments at the boundaries of the core model, at the N- and C-termini or in between the two domains. Most remarkable is a domain swap in two subfamilies of the [st312]NHAC family that results in an inverted orientation of the proteins in the membrane.  相似文献   

12.
Rotavirus (RV) diarrhoea causes huge number deaths in children less than 5 years of age. In spite of available vaccines, it has been difficult to combat RV due to large number of antigenically distinct genotypes, high mutation rates, generation of reassortant viruses due to segmented genome. RV is an eukaryotic virus which utilizes host cell machinery for its propagation. Since RV only encodes 12 proteins, post-translational modification (PTM) is important mechanism for modification, which consequently alters their function. A single protein exhibiting different functions in different locations or in different subcellular sites, are known to be 'moonlighting'. So there is a possibility that viral proteins moonlight in separate location and in different time to exhibit diverse cellular effects. Based on the primary sequence, the putative behaviour of proteins in cellular environment can be predicted, which helps to classify them into different functional families with high reliability score. In this study, sites for phosphorylation, glycosylation and SUMOylation of the six RV structural proteins (VP1, VP2, VP3, VP4, VP6 & VP7) & five non-structural proteins (NSP1, NSP2,NSP3,NSP4 & NSP5) and the functional families were predicted. As NSP6 is a very small protein and not required for virus growth & replication, it was not included in the study. Classification of RV proteins revealed multiple putative functions of each structural protein and varied number of PTM sites, indicating that RV proteins may also moonlight depending on requirements during viral life cycle. Targeting the crucial PTM sites on RV structural proteins may have implications in developing future anti-rotaviral strategies.  相似文献   

13.
14.
A structural database of 11 families of chains differing by a single amino acid substitution has been built. Another structural dataset of 5 families with identical sequences has been used for comparison. The RMSD computed after a global superimposition of the mutated protein on each native one is smaller than the RMSD calculated among proteins of identical sequences. The effect of the perturbation is very local, and not necessarily the highest at the position of the mutation. A RMSD between mutated and native proteins is computed over a 3‐residue or a 7‐residue window at each position. To separate the effects of structural fluctuations due to point mutations from other sources, pair RMSD have been translated into P values which themselves are included in a score called P‐RANK. This score allows highlighting small backbone distortions by comparing these RMSD between mutated and native positions to the RMSD at the same positions in the absence of a mutation. It results from the P‐RANK that 38% of all mutations produce a significant effect on the displacement. When compared with a random distribution of RMSD at un‐mutated positions, we show that, even if the RMSD is greater when the mutation is in loops than in regular secondary structure, the relative effect is more important for regular secondary structures and for buried positions. We confirm the absence of correlation between RMSD and the predicted variation of free energy of folding but we found a small correlation between high RMSD and the error in the prediction of ΔΔG.  相似文献   

15.
A portion of the U.S. National Toxicology Program (NTP) Salmonella typhimurium mutagenicity data base was analyzed by CASE, an artificial intelligence SAR system. CASE identified 13 structural determinants which, with a high probability (p less than or equal to 0.05) predicted the likelihood of mutagenicity of the 243 chemicals in the data base (sensitivity = 0.989; specificity = 0.950) as well as of chemicals not included in the data base. CASE also identified an additional set of structures which were highly predictive of mutagenic potency (sensitivity = 0.949; specificity = 1.00). Even though there is little overlap among the chemicals included in the NTP and Gene-Tox Salmonella data bases, CASE found significant similarities between the structural determinants of the mutagenicity in the two data bases, thereby validating the analyses and indicating a commonality in the structural basis of mutagenicity.  相似文献   

16.
The problem of rational target selection for protein structure determination in structural genomics projects on microbes is addressed. A flexible computational procedure is described that directly incorporates the whole body of annotation available in the PEDANT genome database into the sequence clustering and selection process in order to identify proteins that are likely to possess currently unknown structural domains. Filtering out gene products based on predicted structural features, such as known three-dimensional structures and transmembrane regions, allows one to reduce the complexity of neighbor relationships between sequences and all but eliminates the need for further partitioning of single-linkage clusters into disjoint protein groups corresponding to homologous families. The results of a large-scale computation experiment in which exemplary target selection for 32 prokaryotic genomes was conducted are presented.  相似文献   

17.
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.  相似文献   

18.
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.  相似文献   

19.
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.  相似文献   

20.
A novel method is presented for joint prediction of alignment and common secondary structures of two RNA sequences. The joint consideration of common secondary structures and alignment is accomplished by structural alignment over a search space defined by the newly introduced motif called matched helical regions. The matched helical region formulation generalizes previously employed constraints for structural alignment and thereby better accommodates the structural variability within RNA families. A probabilistic model based on pseudo free energies obtained from precomputed base pairing and alignment probabilities is utilized for scoring structural alignments. Maximum a posteriori (MAP) common secondary structures, sequence alignment and joint posterior probabilities of base pairing are obtained from the model via a dynamic programming algorithm called PARTS. The advantage of the more general structural alignment of PARTS is seen in secondary structure predictions for the RNase P family. For this family, the PARTS MAP predictions of secondary structures and alignment perform significantly better than prior methods that utilize a more restrictive structural alignment model. For the tRNA and 5S rRNA families, the richer structural alignment model of PARTS does not offer a benefit and the method therefore performs comparably with existing alternatives. For all RNA families studied, the posterior probability estimates obtained from PARTS offer an improvement over posterior probability estimates from a single sequence prediction. When considering the base pairings predicted over a threshold value of confidence, the combination of sensitivity and positive predictive value is superior for PARTS than for the single sequence prediction. PARTS source code is available for download under the GNU public license at http://rna.urmc.rochester.edu.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号