首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.

Background  

We propose a method for deriving enzymatic signatures from short read metagenomic data of unknown species. The short read data are converted to six pseudo-peptide candidates. We search for occurrences of Specific Peptides (SPs) on the latter. SPs are peptides that are indicative of enzymatic function as defined by the Enzyme Commission (EC) nomenclature. The number of SP hits on an ensemble of short reads is counted and then converted to estimates of numbers of enzymatic genes associated with different EC categories in the studied metagenome. Relative amounts of different EC categories define the enzymatic spectrum, without the need to perform genomic assemblies of short reads.  相似文献   

2.
Predicting the function of a protein from its sequence is a long-standing goal of bioinformatic research. While sequence similarity is the most popular tool used for this purpose, sequence motifs may also subserve this goal. Here we develop a motif-based method consisting of applying an unsupervised motif extraction algorithm (MEX) to all enzyme sequences, and filtering the results by the four-level classification hierarchy of the Enzyme Commission (EC). The resulting motifs serve as specific peptides (SPs), appearing on single branches of the EC. In contrast to previous motif-based methods, the new method does not require any preprocessing by multiple sequence alignment, nor does it rely on over-representation of motifs within EC branches. The SPs obtained comprise on average 8.4 +/- 4.5 amino acids, and specify the functions of 93% of all enzymes, which is much higher than the coverage of 63% provided by ProSite motifs. The SP classification thus compares favorably with previous function annotation methods and successfully demonstrates an added value in extreme cases where sequence similarity fails. Interestingly, SPs cover most of the annotated active and binding site amino acids, and occur in active-site neighboring 3-D pockets in a highly statistically significant manner. The latter are assumed to have strong biological relevance to the activity of the enzyme. Further filtering of SPs by biological functional annotations results in reduced small subsets of SPs that possess very large enzyme coverage. Overall, SPs both form a very useful tool for enzyme functional classification and bear responsibility for the catalytic biological function carried out by enzymes.  相似文献   

3.

Background  

Bacillus subtilis glucokinase (GlcK) (GenBank NP_390365) is an ATP-dependent kinase that phosphorylates glucose to glucose 6-phosphate. The GlcK protein has very low sequence identity (13.7%) to the Escherichia coli glucokinase (Glk) (GenBank P46880) and some other glucokinases (EC 2.7.1.2), yet glucose is merely its substrate. Our lab has previously isolated and characterized the glcK gene.  相似文献   

4.

Background  

Understanding protein function from its structure is a challenging problem. Sequence based approaches for finding homology have broad use for annotation of both structure and function. 3D structural information of protein domains and their interactions provide a complementary view to structure function relationships to sequence information. We have developed a web site and an API of web services that enables users to submit protein structures and identify statistically significant neighbors and the underlying structural environments that make that match using a suite of sequence and structure analysis tools. To do this, we have integrated S-BLEST, PSI-BLAST and HMMer based superfamily predictions to give a unique integrated view to prediction of SCOP superfamilies, EC number, and GO term, as well as identification of the protein structural environments that are associated with that prediction. Additionally, we have extended UCSF Chimera and PyMOL to support our web services, so that users can characterize their own proteins of interest.  相似文献   

5.

Background  

Detection of DNA-binding sites in proteins is of enormous interest for technologies targeting gene regulation and manipulation. We have previously shown that a residue and its sequence neighbor information can be used to predict DNA-binding candidates in a protein sequence. This sequence-based prediction method is applicable even if no sequence homology with a previously known DNA-binding protein is observed. Here we implement a neural network based algorithm to utilize evolutionary information of amino acid sequences in terms of their position specific scoring matrices (PSSMs) for a better prediction of DNA-binding sites.  相似文献   

6.

Background  

Protein evolution and protein classification are usually inferred by comparing protein cores in their conserved aligned parts. Structurally aligned protein regions are separated by less conserved loop regions, where sequence and structure locally deviate from each other and do not superimpose well.  相似文献   

7.

Background

Individuals with serrated polyps (SP) are at higher risk for synchronous colorectal advanced neoplasms (AN) and cancers. However, it remains unclear whether there is a unique involvement of the serrated pathway and/or the classical adenoma-carcinoma sequence in this setting.

Methods

Colorectal ANs, which include tubular adenomas ≥10 mm, adenomas with villous histology, high-grade intraepithelial neoplasms, and cancers, were collected retrospectively. The groups included ANs with (AN+SP) or without (AN-only) coexisting SPs. Clinicopathological findings were compared between groups. BRAF and KRAS mutations in ANs and SPs, and methylation levels at long interspersed element-1 (LINE-1) in adjacent mucosa were determined by pyrosequencing.

Results

Seventy-five ANs from 40 patients in the AN+SP group, and 179 ANs from 119 patients in the AN-only group were analyzed. There were no significant differences in clinicopathological findings between the two groups, except that intraepithelial neoplasia in the AN+SP group was more likely to be located in the right colon (P = 0.018). BRAF mutations were significantly more frequent in the AN+SP group (P = 0.003), while KRAS mutations showed no significant differences between groups (P = 0.142). The majority of high-grade intraepithelial neoplasms in both groups showed a contiguous component of conventional adenoma. Individuals with large and right-sided SPs had significantly more conventional adenomas compared to those without such SPs (P = 0.027 and P = 0.031, respectively). Adjacent mucosa from individuals with multiple and large SPs showed significantly lower methylation levels at LINE-1 compared to individuals without such associated SPs (P = 0.049 and P = 0.015, respectively).

Conclusion

Our data suggest that both the adenoma-carcinoma sequence and the serrated pathway are operational in individuals with coexisting ANs and SPs. The reduced methylation levels at LINE-1 in the background mucosa suggest the possibility of an underlying ‘field defect’.  相似文献   

8.

Background  

We previously developed EFICAz, an enzyme function inference approach that combines predictions from non-completely overlapping component methods. Two of the four components in the original EFICAz are based on the detection of functionally discriminating residues (FDRs). FDRs distinguish between member of an enzyme family that are homofunctional (classified under the EC number of interest) or heterofunctional (annotated with another EC number or lacking enzymatic activity). Each of the two FDR-based components is associated to one of two specific kinds of enzyme families. EFICAz exhibits high precision performance, except when the maximal test to training sequence identity (MTTSI) is lower than 30%. To improve EFICAz's performance in this regime, we: i) increased the number of predictive components and ii) took advantage of consensual information from the different components to make the final EC number assignment.  相似文献   

9.
Meroz Y  Horn D 《Proteins》2008,72(2):606-612
It has recently been shown (Kunik et al., PLOS Comput Biol 2007;3(8):e167) that the occurrence of specific peptides (SPs) on sequences of enzymes allows for accurate EC classification of enzymes. We inquire whether these SPs play important roles in bringing about the enzymatic function. This is assessed by cross-checking the occurrence of SPs on enzymes with Swiss-Prot annotations and PDB spatial structures of enzymes. Analyzing the coverage of functional annotations of enzymes, we demonstrate that SPs contain major fractions of all annotated features. This result is statistically highly significant and associates over 10% of all SPs with important biological markers. Concentrating on DNA binding regions, relevant to LexA repressor enzymes, we find interesting coverage patterns. Moreover, for the same data, we demonstrate that SPs allow for subclassification of the relevant bacteria into phylogenetic classes. An analysis of mutagen annotations on SPs appearing on all enzymes leads to the conclusion that mutations on SPs tend to damage the enzymatic function much more than expected from a background model, hence SPs are of high importance to enzymatic functions. SPs that lie in 3D pockets that are shared by active and binding sites, are shown to be significantly enriched by glycine, leading to the hypothesis that they are responsible for conformational plasticity. Finally we show that SPs can partially resolve outstanding difficult problems of convergent evolution by representing correctly enzyme functions in spite of remote homologies in sequence and in structure.  相似文献   

10.

Background  

The relationship between divergence of amino-acid sequence and divergence of function among homologous proteins is complex. The assumption that homologs share function – the basis of transfer of annotations in databases – must therefore be regarded with caution. Here, we present a quantitative study of sequence and function divergence, based on the Gene Ontology classification of function. We determined the relationship between sequence divergence and function divergence in 6828 protein families from the PFAM database. Within families there is a broad range of sequence similarity from very closely related proteins – for instance, orthologs in different mammals – to very distantly-related proteins at the limit of reliable recognition of homology.  相似文献   

11.

Background  

The classification of protein domains in the CATH resource is primarily based on structural comparisons, sequence similarity and manual analysis. One of the main bottlenecks in the processing of new entries is the evaluation of 'borderline' cases by human curators with reference to the literature, and better tools for helping both expert and non-expert users quickly identify relevant functional information from text are urgently needed. A text based method for protein classification is presented, which complements the existing sequence and structure-based approaches, especially in cases exhibiting low similarity to existing members and requiring manual intervention. The method is based on the assumption that textual similarity between sets of documents relating to proteins reflects biological function similarities and can be exploited to make classification decisions.  相似文献   

12.

Background  

Fibronectin-binding protein A (FnBPA) mediates adhesion of Staphylococcus aureus to fibronectin, fibrinogen and elastin. We previously reported that S. aureus strain P1 encodes an FnBPA protein where the fibrinogen/elastin-binding domain (A domain) is substantially divergent in amino acid sequence from the archetypal FnBPA of S. aureus NCTC8325, and that these variations created differences in antigenicity. In this study strains from multilocus sequence types (MLST) that spanned the genetic diversity of S.aureus were examined to determine the extent of FnBPA A domain variation within the S. aureus population and its effect on ligand binding and immuno-crossreactivity.  相似文献   

13.

Backgroud

Type III secretion systems (T3SSs) are central to the pathogenesis and specifically deliver their secreted substrates (type III secreted proteins, T3SPs) into host cells. Since T3SPs play a crucial role in pathogen-host interactions, identifying them is crucial to our understanding of the pathogenic mechanisms of T3SSs. This study reports a novel and effective method for identifying the distinctive residues which are conserved different from other SPs for T3SPs prediction. Moreover, the importance of several sequence features was evaluated and further, a promising prediction model was constructed.

Results

Based on the conservation profiles constructed by a position-specific scoring matrix (PSSM), 52 distinctive residues were identified. To our knowledge, this is the first attempt to identify the distinct residues of T3SPs. Of the 52 distinct residues, the first 30 amino acid residues are all included, which is consistent with previous studies reporting that the secretion signal generally occurs within the first 30 residue positions. However, the remaining 22 positions span residues 30–100 were also proven by our method to contain important signal information for T3SP secretion because the translocation of many effectors also depends on the chaperone-binding residues that follow the secretion signal. For further feature optimisation and compression, permutation importance analysis was conducted to select 62 optimal sequence features. A prediction model across 16 species was developed using random forest to classify T3SPs and non-T3 SPs, with high receiver operating curve of 0.93 in the 10-fold cross validation and an accuracy of 94.29% for the test set. Moreover, when performing on a common independent dataset, the results demonstrate that our method outperforms all the others published to date. Finally, the novel, experimentally confirmed T3 effectors were used to further demonstrate the model’s correct application. The model and all data used in this paper are freely available at http://cic.scu.edu.cn/bioinformatics/T3SPs.zip.  相似文献   

14.

Background  

Protein fold recognition is a key step in protein three-dimensional (3D) structure discovery. There are multiple fold discriminatory data sources which use physicochemical and structural properties as well as further data sources derived from local sequence alignments. This raises the issue of finding the most efficient method for combining these different informative data sources and exploring their relative significance for protein fold classification. Kernel methods have been extensively used for biological data analysis. They can incorporate separate fold discriminatory features into kernel matrices which encode the similarity between samples in their respective data sources.  相似文献   

15.

Background  

Despite the current availability of several hundreds of thousands of amino acid sequences, more than 36% of the enzyme activities (EC numbers) defined by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) are not associated with any amino acid sequence in major public databases. This wide gap separating knowledge of biochemical function and sequence information is found for nearly all classes of enzymes. Thus, there is an urgent need to explore these sequence-less EC numbers, in order to progressively close this gap.  相似文献   

16.

Background  

Comparative genomics methods such as phylogenetic profiling can mine powerful inferences from inherently noisy biological data sets. We introduce Sites Inferred by Metabolic Background Assertion Labeling (SIMBAL), a method that applies the Partial Phylogenetic Profiling (PPP) approach locally within a protein sequence to discover short sequence signatures associated with functional sites. The approach is based on the basic scoring mechanism employed by PPP, namely the use of binomial distribution statistics to optimize sequence similarity cutoffs during searches of partitioned training sets.  相似文献   

17.

Background  

The reliable prediction of protein tertiary structure from the amino acid sequence remains challenging even for small proteins. We have developed an all-atom free-energy protein forcefield (PFF01) that we could use to fold several small proteins from completely extended conformations. Because the computational cost of de-novo folding studies rises steeply with system size, this approach is unsuitable for structure prediction purposes. We therefore investigate here a low-cost free-energy relaxation protocol for protein structure prediction that combines heuristic methods for model generation with all-atom free-energy relaxation in PFF01.  相似文献   

18.

Background  

The identification and study of proteins from metagenomic datasets can shed light on the roles and interactions of the source organisms in their communities. However, metagenomic datasets are characterized by the presence of organisms with varying GC composition, codon usage biases etc., and consequently gene identification is challenging. The vast amount of sequence data also requires faster protein family classification tools.  相似文献   

19.

Background  

Predicting a protein's structural or functional class from its amino acid sequence or structure is a fundamental problem in computational biology. Recently, there has been considerable interest in using discriminative learning algorithms, in particular support vector machines (SVMs), for classification of proteins. However, because sufficiently many positive examples are required to train such classifiers, all SVM-based methods are hampered by limited coverage.  相似文献   

20.

Background  

We wished to compare two databases based on sequence similarity: one that aims to be comprehensive in its coverage of known sequences, and one that specialises in a relatively small subset of known sequences. One of the motivations behind this study was quality control. Pfam is a comprehensive collection of alignments and hidden Markov models representing families of proteins and domains. MEROPS is a catalogue and classification of enzymes with proteolytic activity (peptidases or proteases). These secondary databases are used by researchers worldwide, yet their contents are not peer reviewed. Therefore, we hoped that a systematic comparison of the contents of Pfam and MEROPS would highlight missing members and false-positives leading to improvements in quality of both databases. An additional reason for carrying out this study was to explore the extent of consensus in the definition of a protein family.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号