首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 8 毫秒
1.
The annotation of protein function at genomic scale is essential for day-to-day work in biology and for any systematic approach to the modeling of biological systems. Currently, functional annotation is essentially based on the expansion of the relatively small number of experimentally determined functions to large collections of proteins. The task of systematic annotation faces formidable practical problems related to the accuracy of the input experimental information, the reliability of current systems for transferring information between related sequences, and the reproducibility of the links between database information and the original experiments reported in publications. These technical difficulties merely lie on the surface of the deeper problem of the evolution of protein function in the context of protein sequences and structures. Given the mixture of technical and scientific challenges, it is not surprising that errors are introduced, and expanded, in database annotations. In this situation, a more realistic option is the development of a reliability index for database annotations, instead of depending exclusively on efforts to correct databases. Several groups have attempted to compare the database annotations of similar proteins, which constitutes the first steps toward the calibration of the relationship between sequence and annotation space.  相似文献   

2.
A procedure that automatically provides an evaluation of thediagnostic ability of a protein sequence functional patternis described. The procedure relies on the identification ofthe closest definable set in terms of a (protein sequence) databasefunctional annotation to the set of database instances containinga given pattern. Assuming annotation correctness and completenessin the protein sequence database, the degree of statisticalassociation between these sets provides an appropriate measureof the diagnostic ability of the pattern. An experimental implementationof the procedure, using the NBRF/PIR protein database, has beenapplied to a diverse collection of published sequence patterns.Results obtained reveal that frequently it is not possible todefine (in NBRF/PIR database terminology) the set of databaseinstances containing a given pattern, suggesting either lackof pattern diagnostic ability or protein database annotationincompleteness and/or inconsistencies. Received on November 30, 1989; accepted on July 20, 1990  相似文献   

3.
4.
Teng S  Luo H  Wang L 《Amino acids》2012,43(1):447-455
Protein sumoylation is a post-translational modification that plays an important role in a wide range of cellular processes. Small ubiquitin-related modifier (SUMO) can be covalently and reversibly conjugated to the sumoylation sites of target proteins, many of which are implicated in various human genetic disorders. The accurate prediction of protein sumoylation sites may help biomedical researchers to design their experiments and understand the molecular mechanism of protein sumoylation. In this study, a new machine learning approach has been developed for predicting sumoylation sites from protein sequence information. Random forests (RFs) and support vector machines (SVMs) were trained with the data collected from the literature. Domain-specific knowledge in terms of relevant biological features was used for input vector encoding. It was shown that RF classifier performance was affected by the sequence context of sumoylation sites, and 20 residues with the core motif ΨKXE in the middle appeared to provide enough context information for sumoylation site prediction. The RF classifiers were also found to outperform SVM models for predicting protein sumoylation sites from sequence features. The results suggest that the machine learning approach gives rise to more accurate prediction of protein sumoylation sites than the other existing methods. The accurate classifiers have been used to develop a new web server, called seeSUMO (http://bioinfo.ggc.org/seesumo/), for sequence-based prediction of protein sumoylation sites.  相似文献   

5.
6.

Background  

Conserved protein sequence motifs are short stretches of amino acid sequence patterns that potentially encode the function of proteins. Several sequence pattern searching algorithms and programs exist foridentifying candidate protein motifs at the whole genome level. However, amuch needed and importanttask is to determine the functions of the newly identified protein motifs. The Gene Ontology (GO) project is an endeavor to annotate the function of genes or protein sequences with terms from a dynamic, controlled vocabulary and these annotations serve well as a knowledge base.  相似文献   

7.
There are two main reasons to try to predict an enzyme's function from its sequence. The first is to identify the components and thus the functional capabilities of an organism, the second is to create enzymes with specific properties. Genomics, expression analysis, proteomics and metabonomics are largely directed towards understanding how information flows from DNA sequence to protein functions within an organism. This review focuses on information flow in the opposite direction: the applicability of what is being learned from natural enzymes to improve methods for catalyst design.  相似文献   

8.
Prediction of protein function from protein sequence and structure   总被引:1,自引:0,他引:1  
The sequence of a genome contains the plans of the possible life of an organism, but implementation of genetic information depends on the functions of the proteins and nucleic acids that it encodes. Many individual proteins of known sequence and structure present challenges to the understanding of their function. In particular, a number of genes responsible for diseases have been identified but their specific functions are unknown. Whole-genome sequencing projects are a major source of proteins of unknown function. Annotation of a genome involves assignment of functions to gene products, in most cases on the basis of amino-acid sequence alone. 3D structure can aid the assignment of function, motivating the challenge of structural genomics projects to make structural information available for novel uncharacterized proteins. Structure-based identification of homologues often succeeds where sequence-alone-based methods fail, because in many cases evolution retains the folding pattern long after sequence similarity becomes undetectable. Nevertheless, prediction of protein function from sequence and structure is a difficult problem, because homologous proteins often have different functions. Many methods of function prediction rely on identifying similarity in sequence and/or structure between a protein of unknown function and one or more well-understood proteins. Alternative methods include inferring conservation patterns in members of a functionally uncharacterized family for which many sequences and structures are known. However, these inferences are tenuous. Such methods provide reasonable guesses at function, but are far from foolproof. It is therefore fortunate that the development of whole-organism approaches and comparative genomics permits other approaches to function prediction when the data are available. These include the use of protein-protein interaction patterns, and correlations between occurrences of related proteins in different organisms, as indicators of functional properties. Even if it is possible to ascribe a particular function to a gene product, the protein may have multiple functions. A fundamental problem is that function is in many cases an ill-defined concept. In this article we review the state of the art in function prediction and describe some of the underlying difficulties and successes.  相似文献   

9.
As genome sequences and protein structures are deciphered, we wish to predict their corresponding functions. Many functions cannot be told from from the sequence, however, although there has been progress in this quest for an impossible Grail. Furthermore, a structure and its corresponding sequence become most interesting when one knows the function. Inductive reasoning, based on the integration of biological and sequence knowledge, should enable sequence and functional data to be combined in a productive way.  相似文献   

10.
Secretins are a large family of proteins associated with membrane translocation of macromolecular complexes, and a subset of this family, termed PilQ proteins, is required for type IV pilus biogenesis. We analysed the status of PilQ expression in Neisseria meningitidis (Mc) and found that PilQ? mutants were non-piliated and deficient in the expression of pilus-associated phenotypes. Sequence analysis of the 5′ portion of the pilQ ORF of the serogroup B Mc strain 44/76 showed the presence of seven copies of a repetitive sequence element, in contrast to the situation in N. gonorrhoeae (Gc) strains, which carry either two or three copies of the repeat. The derived amino acid sequence of the consensus nucleotide repeat was an octapeptide PAKQQAAA, designated as the small basic repeat (SBR). This gene segment was studied in more detail in a collection of 52 Mc strains of diverse origin by screening for variability in the size of the PCR-generated DNA fragments spanning the SBRs. These strains were found to harbour from four to seven copies of the repetitive element. No association between the number of copies and the serogroup, geographic origin or multilocus genotype of the strains was evident. The presence of polymorphic repeat elements in Mc PilQ is unprecedented within the secretin family. To address the potential function of the repeat containing domain, Mc strains were constructed so as to express chimeric PilQ molecules in which the number of SBR repeats was increased or in which the repeat containing domain was replaced in toto by the corresponding region of the Pseudomonas aeruginosa (Pa) PilQ protein. Although the strain expressing PilQ with an increased number of SBRs was identical to the parent strain in pilus phenotypes, a strain expressing PilQ with the equivalent Pa domain had an eightfold reduction in pilus expression level. The findings suggest that the repeat containing domain of PilQ influences Mc pilus expression quantitatively but not qualitatively.  相似文献   

11.
When amino acid residues are represented by parameters describing their side chain lengths and polarities, a sequence function defined as the sum of the first two sequence autocorrelation functions is found to be negatively and linearly correlated with the logarithms of folding rates of beta-proteins. The new function reveals new features in beta-protein folding: larger residues slow down the folding while alternative distribution of polar-non-polar residues accelerates the folding.  相似文献   

12.
There are constraints on a protein sequence/structure for it to adopt a particular fold. These constraints could be either a local signature involving particular sequences or arrangements of secondary structure or a global signature involving features along the entire chain. To search systematically for protein fold signatures, we have explored the use of Inductive Logic Programming (ILP). ILP is a machine learning technique which derives rules from observation and encoded principles. The derived rules are readily interpreted in terms of concepts used by experts. For 20 populated folds in SCOP, 59 rules were found automatically. The accuracy of these rules, which is defined as the number of true positive plus true negative over the total number of examples, is 74% (cross-validated value). Further analysis was carried out for 23 signatures covering 30% or more positive examples of a particular fold. The work showed that signatures of protein folds exist, about half of rules discovered automatically coincide with the level of fold in the SCOP classification. Other signatures correspond to homologous family and may be the consequence of a functional requirement. Examination of the rules shows that many correspond to established principles published in specific literature. However, in general, the list of signatures is not part of standard biological databases of protein patterns. We find that the length of the loops makes an important contribution to the signatures, suggesting that this is an important determinant of the identity of protein folds. With the expansion in the number of determined protein structures, stimulated by structural genomics initiatives, there will be an increased need for automated methods to extract principles of protein folding from coordinates.  相似文献   

13.
While the number of sequenced genomes continues to grow, experimentally verified functional annotation of whole genomes remains patchy. Structural genomics projects are yielding many protein structures that have unknown function. Nevertheless, subsequent experimental investigation is costly and time-consuming, which makes computational methods for predicting protein function very attractive. There is an increasing number of noteworthy methods for predicting protein function from sequence and structural data alone, many of which are readily available to cell biologists who are aware of the strengths and pitfalls of each available technique.  相似文献   

14.
Issues in predicting protein function from sequence   总被引:1,自引:0,他引:1  
Identifying homologues, defined as genes that arose from a common evolutionary ancestor, is often a relatively straightforward task, thanks to recent advances made in estimating the statistical significance of sequence similarities found from database searches. The extent by which homologues possess similarities in function, however, is less amenable to statistical analysis. Consequently, predicting function by homology is a qualitative, rather than quantitative, process and requires particular care to be taken. This review focuses on the various approaches that have been developed to predict function from the scale of the atom to that of the organism. Similarities in homologues' functions differ considerably at each of these different scales and also vary for different domain families. It is argued that due attention should be paid to all available clues to function, including orthologue identification, conservation of particular residue types, and the co-occurrence of domains in proteins. Pitfalls in database searching methods arising from amino acid compositional bias and database size effects are also discussed.  相似文献   

15.
16.
Liang S  Grishin NV 《Proteins》2004,54(2):271-281
We have developed an effective scoring function for protein design. The atomic solvation parameters, together with the weights of energy terms, were optimized so that residues corresponding to the native sequence were predicted with low energy in the training set of 28 protein structures. The solvation energy of non-hydrogen-bonded hydrophilic atoms was considered separately and expressed in a nonlinear way. As a result, our scoring function predicted native residues as the most favorable in 59% of the total positions in 28 proteins. We then tested the scoring function by comparing the predicted stability changes for 103 T4 lysozyme mutants with the experimental values. The correlation coefficients were 0.77 for surface mutations and 0.71 for all mutations. Finally, the scoring function combined with Monte Carlo simulation was used to predict favorable sequences on a fixed backbone. The designed sequences were similar to the natural sequences of the family to which the template structure belonged. The profile of the designed sequences was helpful for identification of remote homologues of the native sequence.  相似文献   

17.
18.
Automated discovery of 3D motifs for protein function annotation   总被引:2,自引:0,他引:2  
MOTIVATION: Function inference from structure is facilitated by the use of patterns of residues (3D motifs), normally identified by expert knowledge, that correlate with function. As an alternative to often limited expert knowledge, we use machine-learning techniques to identify patterns of 3-10 residues that maximize function prediction. This approach allows us to test the assumption that residues that provide function are the most informative for predicting function. RESULTS: We apply our method, GASPS, to the haloacid dehalogenase, enolase, amidohydrolase and crotonase superfamilies and to the serine proteases. The motifs found by GASPS are as good at function prediction as 3D motifs based on expert knowledge. The GASPS motifs with the greatest ability to predict protein function consist mainly of known functional residues. However, several residues with no known functional role are equally predictive. For four groups, we show that the predictive power of our 3D motifs is comparable with or better than approaches that use the entire fold (Combinatorial-Extension) or sequence profiles (PSI-BLAST). AVAILABILITY: Source code is freely available for academic use by contacting the authors. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

19.
20.
Gupta M 《Biometrics》2007,63(3):797-805
A generalized hierarchical Markov model for sequences that contain length-restricted features is introduced. This model is motivated by the recent development of high-density tiling array data for determining genomic elements of functional importance. Due to length constraints on certain features of interest, as well as variability in probe behavior, usual hidden Markov-type models are not always applicable. A robust Bayesian framework that can incorporate length constraints, probe variability, and bias is developed. Moreover, a novel recursion-based Monte Carlo algorithm is proposed to estimate the parameters and impute hidden states under length constraints. Application of this methodology to yeast chromosomal arrays demonstrate substantial improvement over currently existing methods in terms of sensitivity as well as biological interpretability.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号