首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 9 毫秒
1.
2.
3.
The KB-Rank tool was developed to help determine the functions of proteins. A user provides text query and protein structures are retrieved together with their functional annotation categories. Structures and annotation categories are ranked according to their estimated relevance to the queried text. The algorithm for ranking first retrieves matches between the query text and the text fields associated with the structures. The structures are next ordered by their relative content of annotations that are found to be prevalent across all the structures retrieved. An interactive web interface was implemented to navigate and interpret the relevance of the structures and annotation categories retrieved by a given search. The aim of the KB-Rank tool is to provide a means to quickly identify protein structures of interest and the annotations most relevant to the queries posed by a user. Informational and navigational searches regarding disease topics are described to illustrate the tool's utilities. The tool is available at the URL http://protein.tcmedc.org/KB-Rank.  相似文献   

4.
GoFigure: automated Gene Ontology annotation   总被引:4,自引:0,他引:4  
SUMMARY: We have developed a web tool to predict Gene Ontology (GO) terms. The tool accepts an input DNA or protein sequence, and uses BLAST to identify homologous sequences in GO annotated databases. A graph is returned to the user via email. AVAILABILITY: The tool is freely available at: http://udgenome.ags.udel.edu/frm_go.html/  相似文献   

5.
6.
7.
A fully automated method is presented for determining NMR solution structures of proteins using exclusively NOESY spectra as input, obviating the need to measure any spectra only for obtaining resonance assignments but devoid of structural information. Applied to two small proteins, the approach yielded structures that coincided closely with conventionally determined structures.  相似文献   

8.
MOTIVATION: The gap between the amount of newly submitted protein data and reliable functional annotation in public databases is growing. Traditional manual annotation by literature curation and sequence analysis tools without the use of automated annotation systems is not able to keep up with the ever increasing quantity of data that is submitted. Automated supplements to manually curated databases such as TrEMBL or GenPept cover raw data but provide only limited annotation. To improve this situation automatic tools are needed that support manual annotation, automatically increase the amount of reliable information and help to detect inconsistencies in manually generated annotations. RESULTS: A standard data mining algorithm was successfully applied to gain knowledge about the Keyword annotation in SWISS-PROT. 11 306 rules were generated, which are provided in a database and can be applied to yet unannotated protein sequences and viewed using a web browser. They rely on the taxonomy of the organism, in which the protein was found and on signature matches of its sequence. The statistical evaluation of the generated rules by cross-validation suggests that by applying them on arbitrary proteins 33% of their keyword annotation can be generated with an error rate of 1.5%. The coverage rate of the keyword annotation can be increased to 60% by tolerating a higher error rate of 5%. AVAILABILITY: The results of the automatic data mining process can be browsed on http://golgi.ebi.ac.uk:8080/Spearmint/ Source code is available upon request. CONTACT: kretsch@ebi.ac.uk.  相似文献   

9.
Error tolerant backbone resonance assignment is the cornerstone of the NMR structure determination process. Although a variety of assignment approaches have been developed, none works sufficiently well on noisy fully automatically picked peaks to enable the subsequent automatic structure determination steps. We have designed an integer linear programming (ILP) based assignment system (IPASS) that has enabled fully automatic protein structure determination for four test proteins. IPASS employs probabilistic spin system typing based on chemical shifts and secondary structure predictions. Furthermore, IPASS extracts connectivity information from the inter-residue information and the (automatically picked) (15)N-edited NOESY peaks which are then used to fix reliable fragments. When applied to automatically picked peaks for real proteins, IPASS achieves an average precision and recall of 82% and 63%, respectively. In contrast, the next best method, MARS, achieves an average precision and recall of 77% and 36%, respectively. The assignments generated by IPASS are then fed into our protein structure calculation system, FALCON-NMR, to determine the 3D structures without human intervention. The final models have backbone RMSDs of 1.25?, 0.88?, 1.49?, and 0.67? to the reference native structures for proteins TM1112, CASKIN, VRAR, and HACS1, respectively. The web server is publicly available at http://monod.uwaterloo.ca/nmr/ipass.  相似文献   

10.
Modern structural genomics projects demand for integrated methods for the interpretation and storage of nuclear magnetic resonance (NMR) data. Here we present version 2.1 of our program ARIA (Ambiguous Restraints for Iterative Assignment) for automated assignment of nuclear Overhauser enhancement (NOE) data and NMR structure calculation. We report on recent developments, most notably a graphical user interface, and the incorporation of the object-oriented data model of the Collaborative Computing Project for NMR (CCPN). The CCPN data model defines a storage model for NMR data, which greatly facilitates the transfer of data between different NMR software packages. Availability: A distribution with the source code of ARIA 2.1 is freely available at http://www.pasteur.fr/recherche/unites/Binfs/aria2.  相似文献   

11.
Protein structure determination by NMR can in principle be speeded up both by reducing the measurement time on the NMR spectrometer and by a more efficient analysis of the spectra. Here we study the reliability of protein structure determination based on a single type of spectra, namely nuclear Overhauser effect spectroscopy (NOESY), using a fully automated procedure for the sequence-specific resonance assignment with the recently introduced FLYA algorithm, followed by combined automated NOE distance restraint assignment and structure calculation with CYANA. This NOESY-FLYA method was applied to eight proteins with 63–160 residues for which resonance assignments and solution structures had previously been determined by the Northeast Structural Genomics Consortium (NESG), and unrefined and refined NOESY data sets have been made available for the Critical Assessment of Automated Structure Determination of Proteins by NMR project. Using only peak lists from three-dimensional 13C- or 15N-resolved NOESY spectra as input, the FLYA algorithm yielded for the eight proteins 91–98 % correct backbone and side-chain assignments if manually refined peak lists are used, and 64–96 % correct assignments based on raw peak lists. Subsequent structure calculations with CYANA then produced structures with root-mean-square deviation (RMSD) values to the manually determined reference structures of 0.8–2.0 Å if refined peak lists are used. With raw peak lists, calculations for 4 proteins converged resulting in RMSDs to the reference structure of 0.8–2.8 Å, whereas no convergence was obtained for the four other proteins (two of which did already not converge with the correct manual resonance assignments given as input). These results show that, given high-quality experimental NOESY peak lists, the chemical shift assignments can be uncovered, without any recourse to traditional through-bond type assignment experiments, to an extent that is sufficient for calculating accurate three-dimensional structures.  相似文献   

12.
MOTIVATION: Position specific scoring matrices (PSSMs) corresponding to aligned sequences of homologous proteins are commonly used in homology detection. A PSSM is generated on the basis of one of the homologues as a reference sequence, which is the query in the case of PSI-BLAST searches. The reference sequence is chosen arbitrarily while generating PSSMs for reverse BLAST searches. In this work we demonstrate that the use of multiple PSSMs corresponding to a given alignment and variable reference sequences is more effective than using traditional single PSSMs and hidden Markov models. RESULTS: Searches for proteins with known 3-D structures have been made against three databases of protein family profiles corresponding to known structures: (1) One PSSM per family; (2) multiple PSSMs corresponding to an alignment and variable reference sequences for every family; and (3) hidden Markov models. A comparison of the performances of these three approaches suggests that the use of multiple PSSMs is most effective. CONTACT: ns@mbu.iisc.ernet.in.  相似文献   

13.
Profile matching methods are commonly used in searches in protein sequence databases to detect evolutionary relationships. We describe here a sensitive protocol, which detects remote similarities by searching in a specialized database of sequences belonging to a fold. We have assessed this protocol by exploring the relationships we detect among sequences known to belong to specific folds. We find that searches within sequences adopting a fold are more effective in detecting remote similarities and evolutionary connections than searches in a database of all sequences. We also discuss the implications of using this strategy to link sequence and structure space.  相似文献   

14.
MOTIVATION: The function of an unknown biological sequence can often be accurately inferred if we are able to map this unknown sequence to its corresponding homologous family. At present, discriminative methods such as SVM-Fisher and SVM-pairwise, which combine support vector machine (SVM) and sequence similarity, are recognized as the most accurate methods, with SVM-pairwise being the most accurate. However, these methods typically encode sequence information into their feature vectors and ignore the structure information. They are also computationally inefficient. Based on these observations, we present an alternative method for SVM-based protein classification. Our proposed method, SVM-I-sites, utilizes structure similarity for remote homology detection. RESULT: We run experiments on the Structural Classification of Proteins 1.53 data set. The results show that SVM-I-sites is more efficient than SVM-pairwise. Further, we find that SVM-I-sites outperforms sequence-based methods such as PSI-BLAST, SAM, and SVM-Fisher while achieving a comparable performance with SVM-pairwise. AVAILABILITY: I-sites server is accessible through the web at http://www.bioinfo.rpi.edu. Programs are available upon request for academics. Licensing agreements are available for commercial interests. The framework of encoding local structure into feature vector is available upon request.  相似文献   

15.
16.
TRAP, the Tandem Repeats Analysis Program, is a Perl program that provides a unified set of analyses for the selection, classification, quantification and automated annotation of tandemly repeated sequences. TRAP uses the results of the Tandem Repeats Finder program to perform a global analysis of the satellite content of DNA sequences, permitting researchers to easily assess the tandem repeat content for both individual sequences and whole genomes. The results can be generated in convenient formats such as HTML and comma-separated values. TRAP can also be used to automatically generate annotation data in the format of feature table and GFF files.  相似文献   

17.
Organisms with a high density of transposable elements (TEs) exhibit nesting, with subsequent repeats found inside previously inserted elements. Nesting splits the sequence structure of TEs and makes annotation of repetitive areas challenging. We present TEnest, a repeat identification and display tool made specifically for highly repetitive genomes. TEnest identifies repetitive sequences and reconstructs separated sections to provide full-length repeats and, for long-terminal repeat (LTR) retrotransposons, calculates age since insertion based on LTR divergence. TEnest provides a chronological insertion display to give an accurate visual representation of TE integration history showing timeline, location, and families of each TE identified, thus creating a framework from which evolutionary comparisons can be made among various regions of the genome. A database of repeats has been developed for maize (Zea mays), rice (Oryza sativa), wheat (Triticum aestivum), and barley (Hordeum vulgare) to illustrate the potential of TEnest software. All currently finished maize bacterial artificial chromosomes totaling 29.3 Mb were analyzed with TEnest to provide a characterization of the repeat insertions. Sixty-seven percent of the maize genome was found to be made up of TEs; of these, 95% are LTR retrotransposons. The rate of solo LTR formation is shown to be dissimilar across retrotransposon families. Phylogenetic analysis of TE families reveals specific events of extreme TE proliferation, which may explain the high quantities of certain TE families found throughout the maize genome. The TEnest software package is available for use on PlantGDB under the tools section (http://www.plantgdb.org/prj/TE_nest/TE_nest.html); the source code is available from (http://wiselab.org).  相似文献   

18.
An extensive effort of the International Rice Genome Sequencing Project (IRGSP) has resulted in rapid accumulation of genome sequence, and >137 Mb has already been made available to the public domain as of August 2001. This requires a high-throughput annotation scheme to extract biologically useful and timely information from the sequence data on a regular basis. A new automated annotation system and database called Rice Genome Automated Annotation System (RiceGAAS) has been developed to execute a reliable and up-to-date analysis of the genome sequence as well as to store and retrieve the results of annotation. The system has the following functional features: (i) collection of rice genome sequences from GenBank; (ii) execution of gene prediction and homology search programs; (iii) integration of results from various analyses and automatic interpretation of coding regions; (iv) re-execution of analysis, integration and automatic interpretation with the latest entries in reference databases; (v) integrated visualization of the stored data using web-based graphical view. RiceGAAS also has a data submission mechanism that allows public users to perform fully automated annotation of their own sequences. The system can be accessed at http://RiceGAAS.dna.affrc.go.jp/.  相似文献   

19.
MOTIVATION: Much current research in biomedical text mining is concerned with serving biologists by extracting certain information from scientific text. We note that there is no 'average biologist' client; different users have distinct needs. For instance, as noted in past evaluation efforts (BioCreative, TREC, KDD) database curators are often interested in sentences showing experimental evidence and methods. Conversely, lab scientists searching for known information about a protein may seek facts, typically stated with high confidence. Text-mining systems can target specific end-users and become more effective, if the system can first identify text regions rich in the type of scientific content that is of interest to the user, retrieve documents that have many such regions, and focus on fact extraction from these regions. Here, we study the ability to characterize and classify such text automatically. We have recently introduced a multi-dimensional categorization and annotation scheme, developed to be applicable to a wide variety of biomedical documents and scientific statements, while intended to support specific biomedical retrieval and extraction tasks. RESULTS: The annotation scheme was applied to a large corpus in a controlled effort by eight independent annotators, where three individual annotators independently tagged each sentence. We then trained and tested machine learning classifiers to automatically categorize sentence fragments based on the annotation. We discuss here the issues involved in this task, and present an overview of the results. The latter strongly suggest that automatic annotation along most of the dimensions is highly feasible, and that this new framework for scientific sentence categorization is applicable in practice.  相似文献   

20.
MOTIVATION: Detecting genes in viral genomes is a complex task. Due to the biological necessity of them being constrained in length, RNA viruses in particular tend to code in overlapping reading frames. Since one amino acid is encoded by a triplet of nucleic acids, up to three genes may be coded for simultaneously in one direction. Conventional hidden Markov model (HMM)-based gene-finding algorithms may typically find it difficult to identify multiple coding regions, since in general their topologies do not allow for the presence of overlapping or nested genes. Comparative methods have therefore been restricted to likelihood ratio tests on potential regions as to being double or single coding, using the fact that the constrictions forced upon multiple-coding nucleotides will result in atypical sequence evolution. Exploiting these same constraints, we present an HMM based gene-finding program, which allows for coding in unidirectional nested and overlapping reading frames, to annotate two homologous aligned viral genomes. Our method does not insist on conserved gene structure between the two sequences, thus making it applicable for the pairwise comparison of more distantly related sequences. RESULTS: We apply our method to 15 pairwise alignments of six different HIV2 genomes. Given sufficient evolutionary distance between the two sequences, we achieve sensitivity of approximately 84-89% and specificity of approximately 97-99.9%. We additionally annotate three pairwise alignments of the more distantly related HIV1 and HIV2, as well as of two different hepatitis viruses, attaining results of approximately 87% sensitivity and approximately 98.5% specificity. We subsequently incorporate prior knowledge by 'knowing' the gene structure of one sequence and annotating the other conditional on it. Boosting accuracy close to perfect we demonstrate that conservation of gene structure on top of nucleotide sequence is a valuable source of information, especially in distantly related genomes. AVAILABILITY: The Java code is available from the authors.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号