首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A substantial percentage of the putative protein-encoding open reading frames (ORFs) in bacterial genomes have no homolog of known function, and their function cannot be confidently assigned on the basis of sequence similarity. Methods not based on sequence similarity are needed and being developed. One method, SVMProt (http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi), predicts protein functional family irrespective of sequence similarity (Nucleic Acids Res. 2003;31:3692-3697). While it has been tested on a large number of proteins, its capability for non-homologous proteins has so far been evaluated for a relatively small number of proteins, and additional tests are needed to more fully assess SVMProt. In this work, 90 novel bacterial proteins (non-homologous to known proteins) are used to evaluate the capability of SVMProt. These proteins are such that none of their homologs are in the Swiss-Prot database, their functions not clearly described in the literature, and they themselves and their homologs are not included in the training sets of SVMProt. They represent proteins whose function cannot be confidently predicted by sequence similarity methods at present. The predicted functional class of 76.7% of each of these proteins shows various levels of consistency with the literature-described function, compared to the overall accuracy of 87% for the SVMProt functional class assignment of 34,582 proteins that have at least one homolog of known function. Our study suggests that SVMProt is capable of assigning functional class for novel bacterial proteins at a level not too much lower than that of sequence alignment methods for homologous proteins.  相似文献   

2.
Lipid binding proteins play important roles in signaling, regulation, membrane trafficking, immune response, lipid metabolism, and transport. Because of their functional and sequence diversity, it is desirable to explore additional methods for predicting lipid binding proteins irrespective of sequence similarity. This work explores the use of support vector machines (SVMs) as such a method. SVM prediction systems are developed using 14,776 lipid binding and 133,441 nonlipid binding proteins and are evaluated by an independent set of 6,768 lipid binding and 64,761 nonlipid binding proteins. The computed prediction accuracy is 78.9, 79.5, 82.2, 79.5, 84.4, 76.6, 90.6, 79.0, and 89.9% for lipid degradation, lipid metabolism, lipid synthesis, lipid transport, lipid binding, lipopolysaccharide biosynthesis, lipoprotein, lipoyl, and all lipid binding proteins, respectively. The accuracy for the nonmember proteins of each class is 99.9, 99.2, 99.6, 99.8, 99.9, 99.8, 98.5, 99.9, and 97.0%, respectively. Comparable accuracies are obtained when homologous proteins are considered as one, or by using a different SVM kernel function. Our method predicts 86.8% of the 76 lipid binding proteins nonhomologous to any protein in the Swiss-Prot database and 89.0% of the 73 known lipid binding domains as lipid binding. These findings suggest the usefulness of SVMs for facilitating the prediction of lipid binding proteins. Our software can be accessed at the SVMProt server (http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi).  相似文献   

3.
Han L  Cui J  Lin H  Ji Z  Cao Z  Li Y  Chen Y 《Proteomics》2006,6(14):4023-4037
Protein sequence contains clues to its function. Functional prediction from sequence presents a challenge particularly for proteins that have low or no sequence similarity to proteins of known function. Recently, machine learning methods have been explored for predicting functional class of proteins from sequence-derived properties independent of sequence similarity, which showed promising potential for low- and non-homologous proteins. These methods can thus be explored as potential tools to complement alignment- and clustering-based methods for predicting protein function. This article reviews the strategies, current progresses, and underlying difficulties in using machine learning methods for predicting the functional class of proteins. The relevant software and web-servers are described. The reported prediction performances in the application of these methods are also presented, which need to be interpreted with caution as they are dependent on such factors as datasets used and choice of parameters.  相似文献   

4.
In plant genomes, the function of a substantial percentage of the putative protein-coding open reading frames (ORFs) is unknown. These ORFs have no significant sequence similarity to known proteins, which complicates the task of functional study of these proteins. Efforts are being made to explore methods that are complementary to, or may be used in combination with, sequence alignment and clustering methods. A web-based protein functional class prediction software, SVMProt, has shown some capability for predicting functional class of distantly related proteins. Here the usefulness of SVMProt for functional study of novel plant proteins is evaluated. To test SVMProt, 49 plant proteins (without a sequence homolog in the Swiss-Prot protein database, not in the SVMProt training set, and with functional indications provided in the literature) were selected from a comprehensive search of MEDLINE abstracts and Swiss-Prot databases in 1999-2004. These represent unique proteins the function of which, at present, cannot be confidently predicted by sequence alignment and clustering methods. The predicted functional class of 31 proteins was consistent, and that of four other proteins was weakly consistent, with published functions. Overall, the functional class of 71.4% of these proteins was consistent, or weakly consistent, with functional indications described in the literature. SVMProt shows a certain level of ability to provide useful hints about the functions of novel plant proteins with no similarity to known proteins.  相似文献   

5.
Historically, biotechnology has missed up to 99% of existing microbial resources by using traditional screening techniques. Strategies of directly cloning 'environmental DNA' comprising the genetic blueprints of entire microbial consortia (the so-called 'metagenome') provide molecular sequence space that along with ingenious in vitro evolution technologies will act synergistically to bring a maximum of available sequence-space into biocatalytic application.  相似文献   

6.
Enzymatic substrate promiscuity is more ubiquitous than previously thought, with significant consequences for understanding metabolism and its application to biocatalysis. This realization has given rise to the need for efficient characterization of enzyme promiscuity. Enzyme promiscuity is currently characterized with a limited number of human-selected compounds that may not be representative of the enzyme's versatility. While testing large numbers of compounds may be impractical, computational approaches can exploit existing data to determine the most informative substrates to test next, thereby more thoroughly exploring an enzyme's versatility. To demonstrate this, we used existing studies and tested compounds for four different enzymes, developed support vector machine (SVM) models using these datasets, and selected additional compounds for experiments using an active learning approach. SVMs trained on a chemically diverse set of compounds were discovered to achieve maximum accuracies of ~80% using ~33% fewer compounds than datasets based on all compounds tested in existing studies. Active learning-selected compounds for testing resolved apparent conflicts in the existing training data, while adding diversity to the dataset. The application of these algorithms to wide arrays of metabolic enzymes would result in a library of SVMs that can predict high-probability promiscuous enzymatic reactions and could prove a valuable resource for the design of novel metabolic pathways.  相似文献   

7.
Figaro: a novel statistical method for vector sequence removal   总被引:1,自引:0,他引:1  
MOTIVATION: Sequences produced by automated Sanger sequencing machines frequently contain fragments of the cloning vector on their ends. Software tools currently available for identifying and removing the vector sequence require knowledge of the vector sequence, specific splice sites and any adapter sequences used in the experiment-information often omitted from public databases. Furthermore, the clipping coordinates themselves are missing or incorrectly reported. As an example, within the approximately 1.24 billion shotgun sequences deposited in the NCBI Trace Archive, as many as approximately 735 million (approximately 60%) lack vector clipping information. Correct clipping information is essential to scientists attempting to validate, improve and even finish the increasingly large number of genomes released at a 'draft' quality level. RESULTS: We present here Figaro, a novel software tool for identifying and removing the vector from raw sequence data without prior knowledge of the vector sequence. The vector sequence is automatically inferred by analyzing the frequency of occurrence of short oligo-nucleotides using Poisson statistics. We show that Figaro achieves 99.98% sensitivity when tested on approximately 1.5 million shotgun reads from Drosophila pseudoobscura. We further explore the impact of accurate vector trimming on the quality of whole-genome assemblies by re-assembling two bacterial genomes from shotgun sequences deposited in the Trace Archive. Designed as a module in large computational pipelines, Figaro is fast, lightweight and flexible. AVAILABILITY: Figaro is released under an open-source license through the AMOS package (http://amos.sourceforge.net/Figaro).  相似文献   

8.
9.

Background  

Modelling the interaction between potentially antigenic peptides and Major Histocompatibility Complex (MHC) molecules is a key step in identifying potential T-cell epitopes. For Class II MHC alleles, the binding groove is open at both ends, causing ambiguity in the positional alignment between the groove and peptide, as well as creating uncertainty as to what parts of the peptide interact with the MHC. Moreover, the antigenic peptides have variable lengths, making naive modelling methods difficult to apply. This paper introduces a kernel method that can handle variable length peptides effectively by quantifying similarities between peptide sequences and integrating these into the kernel.  相似文献   

10.

Background  

In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences.  相似文献   

11.
12.
An important task in functional genomics is to cluster homologous proteins, which may share common functions. Annotating proteins of unknown function by transferring annotations from their homologues of known annotations is one of the most efficient ways to predict protein function. In this paper, we use a modularity-based method called CD for grouping together homologous proteins. The method employs a global heuristic search strategy to find the partitioning of the weighted adjacency graph with the largest modularity. The weighted adjacency graph is constructed by the sigmodal transformation of all pairwise sequence similarities between all protein sequences in a given dataset. The method has been extensively tested on several subsets from the superfamily level of the SCOP (Structural Classification of Proteins) database, where some homologous proteins have very low sequence similarity. Compared with a widely used method MCL, we observe that the number of clusters obtained by CD is closer to the number of superfamilies in the dataset, the value of the F-measure given by CD is 10% better than MCL on average, and CD is more tolerant to noise to the sequence similarity. The experiment results indicate that CD is ideally suitable for clustering homologous proteins when sequence similarity is low.  相似文献   

13.
When preparing data sets of amino acid or nucleotide sequences it is necessary to exclude redundant or homologous sequences in order to avoid overestimating the predictive performance of an algorithm. For some time methods for doing this have been available in the area of protein structure prediction. We have developed a similar procedure based on pair-wise alignments for sequences with functional sites. We show how a correlation coefficient between sequence similarity and functional homology can be used to compare the efficiency of different similarity measures and choose a nonarbitrary threshold value for excluding redundant sequences. The impact of the choice of scoring matrix used in the alignments is examined. We demonstrate that the parameter determining the quality of the correlation is the relative entropy of the matrix, rather than the assumed (PAM or identity) substitution model. Results are presented for the case of prediction of cleavage sites in signal peptides. By inspection of the false positives, several errors in the database were found. The procedure presented may be used as a general outline for finding a problem-specific similarity measure and threshold value for analysis of other functional amino acid or nucleotide sequence patterns.  相似文献   

14.
A method is described for estimating the distribution and hence testing the statistical significance of sequence similarity scores obtained during a data-bank search. Maximum-likelihood is used to fit a model to the scores, avoiding any costly simulation of random sequences. The method is applied in detail to the Smith-Waterman algorithm when gaps are allowed, and is shown to give results very similar to those obtained by simulation.  相似文献   

15.

Background  

The inference of homology between proteins is a key problem in molecular biology The current best approaches only identify ~50% of homologies (with a false positive rate set at 1/1000).  相似文献   

16.
17.
The rapid increase in genomic sequences provides new opportunities for comparative genomics. In this report, we describe a novel family of repeat sequences that is present in Bacteria and Archaea but not in Eukarya. The repeat loci typically consisted of repetitive stretches of nucleotides with a length of 25 to 37 bp alternated by nonrepetitive DNA spacers of approximately equal size as the repeats. The nucleotide sequences and the size of the repeats were highly conserved within a species, but between species the sequences showed no similarity. Due to their characteristic structure, we have designated this family of repeat loci as SPacers Interspersed Direct Repeats (SPIDR). The SPIDR loci were identified in more than forty different prokaryotic species. Individual species such as Mycobacterium tuberculosis contain one SPIDR locus, while other species such as Methanococcus jannaschii contained up to 20 different loci. The number of repeats in a locus varies greatly from two repeats to several dozens of repeats. The SPIDR loci were flanked by a common 300-500-bp leader sequence, which appeared to be conserved within a species but not between species. The SPIDR locus of M. tuberculosis is extensively used for strain typing. The finding of SPIDR loci in other prokaryotes, including the pathogens Salmonella, Campylobacter, and Pasteurella may extend this surveillance to other species.  相似文献   

18.
The crystal structure of a pepstatin-insensitive carboxyl proteinase from Pseudomonas sp. 101 (PSCP) has been solved by single-wavelength anomalous diffraction using the absorption peak of bromide anions. Structures of the uninhibited enzyme and of complexes with an inhibitor that was either covalently or noncovalently bound were refined at 1.0-1.4 A resolution. The structure of PSCP comprises a single compact domain with a diameter of approximately 55 A, consisting of a seven-stranded parallel beta-sheet flanked on both sides by a number of helices. The fold of PSCP is a superset of the subtilisin fold, and the covalently bound inhibitor is linked to the enzyme through a serine residue. Thus, the structure of PSCP defines a novel family of serine-carboxyl proteinases (defined as MEROPS S53) with a unique catalytic triad consisting of Glu 80, Asp 84 and Ser 287.  相似文献   

19.
Raffinose and stachyose are ubiquitous galactosyl-sucrose oligosaccharides in the plant kingdom which play major roles, second only to sucrose, in photoassimilate translocation and seed carbohydrate storage. These sugars are initially metabolised by alpha-galactosidases (alpha-gal). We report the cloning and functional expression of the first genes, CmAGA1 and CmAGA2, encoding for plant alpha-gals with alkaline pH optima from melon fruit (Cucumis melo L.), a raffinose and stachyose translocating species. The alkaline alpha-gal genes show very high sequence homology with a family of undefined 'seed imbibition proteins' (SIPs) which are present in a wide range of plant families. In order to confirm the function of SIP proteins, a representative SIP gene, from tomato, was expressed and shown to have alkaline alpha-gal activity. Phylogenetic analysis based on amino acid sequences shows that the family of alkaline alpha-gals shares little homology with the known prokaryotic and eukaryotic alpha-gals of glycosyl hydrolase families 27 and 36, with the exception of two cross-family conserved sequences containing aspartates which probably function in the catalytic step. This previously uncharacterised, plant-specific alpha-gal family of glycosyl hydrolases, with optimal activity at neutral-alkaline pH likely functions in key processes of galactosyl-oligosaccharide metabolism, such as during seed germination and translocation of RFO photosynthate.  相似文献   

20.
A novel multivariate statistical approach is presented for extracting and exploiting intrinsic information present in our ever-growing sequence data banks. The information extraction from the sequences avoids the pitfalls of intersequence alignment by analyzing secondary invariant functions derived from the sequences in the data bank rather than the sequences themselves. Such typical invariant function is a 20 x 20 histogram of occurrences of amino acid pairs in a given sequence or fragment thereof. To illustrate the potential of the approach an analysis of 10,000 protein sequences from the National Biomedical Research Foundation Protein Identification Resource is presented, whose analysis already reveals great biological detail. For example, zeta-hemoglobin is found to lie close to amphibian and fish chi-hemoglobin which, in turn, is an important clue to the physiological function of this mammalian early embryonic hemoglobin. The multivariate statistical framework presented unifies such apparently unrelated issues as phylogenetic comparisons between a set of sequences and distance matrices between the constituents of the biological sequences. The Multivariate Statistical Sequence Analysis (MSSA) principles can be used for a wide spectrum of sequence analysis problems such as: assignment of family memberships to new sequences, validation of new incoming sequences to be entered into the database, prediction of structure from sequence, discrimination of coding from non-coding DNA regions, and automatic generation of an atlas of protein or DNA sequences. The MSSA techniques represent a self-contained approach to learning continuously and automatically from the growing stream of new sequences. The MSSA approach is particularly likely to play a significant role in major sequencing efforts such as the human genome project.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号