首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/.  相似文献   

2.
3.
4.
5.
An extensive effort of the International Rice Genome Sequencing Project (IRGSP) has resulted in rapid accumulation of genome sequence, and >137 Mb has already been made available to the public domain as of August 2001. This requires a high-throughput annotation scheme to extract biologically useful and timely information from the sequence data on a regular basis. A new automated annotation system and database called Rice Genome Automated Annotation System (RiceGAAS) has been developed to execute a reliable and up-to-date analysis of the genome sequence as well as to store and retrieve the results of annotation. The system has the following functional features: (i) collection of rice genome sequences from GenBank; (ii) execution of gene prediction and homology search programs; (iii) integration of results from various analyses and automatic interpretation of coding regions; (iv) re-execution of analysis, integration and automatic interpretation with the latest entries in reference databases; (v) integrated visualization of the stored data using web-based graphical view. RiceGAAS also has a data submission mechanism that allows public users to perform fully automated annotation of their own sequences. The system can be accessed at http://RiceGAAS.dna.affrc.go.jp/.  相似文献   

6.
7.
When preparing data sets of amino acid or nucleotide sequences it is necessary to exclude redundant or homologous sequences in order to avoid overestimating the predictive performance of an algorithm. For some time methods for doing this have been available in the area of protein structure prediction. We have developed a similar procedure based on pair-wise alignments for sequences with functional sites. We show how a correlation coefficient between sequence similarity and functional homology can be used to compare the efficiency of different similarity measures and choose a nonarbitrary threshold value for excluding redundant sequences. The impact of the choice of scoring matrix used in the alignments is examined. We demonstrate that the parameter determining the quality of the correlation is the relative entropy of the matrix, rather than the assumed (PAM or identity) substitution model. Results are presented for the case of prediction of cleavage sites in signal peptides. By inspection of the false positives, several errors in the database were found. The procedure presented may be used as a general outline for finding a problem-specific similarity measure and threshold value for analysis of other functional amino acid or nucleotide sequence patterns.  相似文献   

8.
MOTIVATION: The availability of increasing amounts of sequence data about completely sequenced genomes spurs the development of new methods in the fields of automated annotation, and of comparative genomics. Tools allowing the visualization of results produced by analysis methods, superimposed on possibly annotated sequence data, and enabling synchronized navigation in multiple genomes, provide new means for interactive genome exploration. This kind of visual inspection can be used as a basis to assess the quality of new analysis algorithms, or to discover genome portions to be subjected to in-depth studies. RESULTS: We propose a software package, MuGeN, built for navigating through multiple annotated genomes. It is capable of retrieving annotated sequences in several formats, stored in local files, or available in databases over the network. From these, it then generates an interactive display, or an image file, in most common formats suitable for printing, further editing or integrating in Web pages. Genome maps may be mixed with computer analysis results loaded from XML files, whose format is generic enough to be adapted to a majority of sequence oriented analysis methods. AVAILABILITY: MuGeN is available at http://www-mig.jouy.inra.fr/bdsi/MuGeN.  相似文献   

9.
10.
Personal access to sequence databases on personal computers.   总被引:6,自引:0,他引:6       下载免费PDF全文
A comprehensive package of software has been developed to access nucleic acid and protein sequence databases on stand-alone IBM personal computers. The software combines keyword search on the annotation fields of the data with pattern matching algorithms on the biological sequences. Sequences containing complex sites like promoters or kink sites can be identified as well as sequences that are similar to a query sequence. Protein sequences with particular patterns of amino acids such as hydrophobic regions can be identified as well. Considering the relatively inexpensive hard disks now available, personal computers have become a cost-effective alternative to mainframe processing for sequence databases.  相似文献   

11.
SUMMARY: SQUINT is a sequence alignment tool, and combines both automated progressive sequence alignment with facilities for manual editing. The program imports nucleotide or amino acid sequence multiple alignment files in standard formats, and permits users to view two translations of the same multiple alignment simultaneously. Edits in one view are instantaneously reflected in the other, and the scoring cost of the changes are shown in real-time. Progressive multiple alignments, using a variety of alignment parameters, can be performed on any block of sequences, including blocks embedded in the existing alignment. AVAILABILITY: The software is freely available for download at http://www.cebl.auckland.ac.nz  相似文献   

12.
13.
MOTIVATION: Motif detection is an important component of the classification and annotation of protein sequences. A method for aligning motifs with an amino acid sequence is introduced. The motifs can be described by the secondary (i.e. functional, biophysical, etc.) characteristics of a signal or pattern to be detected. The results produced are based on the statistical relevance of the alignment. The method was targeted to avoid the problems (i.e. over-fitting, biological interpretation and mathematical soundness) encountered in other methods currently available. RESULTS: The method was tested on lipoprotein signals in B. subtilis yielding stable results. The results of signal prediction were consistent with other methods where literature was available. AVAILABILITY: An implementation of the motif alignment, refining and bootstrapping is available for public use online at http://www.expasy.org/tools/patoseq/  相似文献   

14.
We have developed a rice (Oryza sativa) genome annotation database (Osa1) that provides structural and functional annotation for this emerging model species. Using the sequence of O. sativa subsp. japonica cv Nipponbare from the International Rice Genome Sequencing Project, pseudomolecules, or virtual contigs, of the 12 rice chromosomes were constructed. Our most recent release, version 3, represents our third build of the pseudomolecules and is composed of 98% finished sequence. Genes were identified using a series of computational methods developed for Arabidopsis (Arabidopsis thaliana) that were modified for use with the rice genome. In release 3 of our annotation, we identified 57,915 genes, of which 14,196 are related to transposable elements. Of these 43,719 non-transposable element-related genes, 18,545 (42.4%) were annotated with a putative function, 5,777 (13.2%) were annotated as encoding an expressed protein with no known function, and the remaining 19,397 (44.4%) were annotated as encoding a hypothetical protein. Multiple splice forms (5,873) were detected for 2,538 genes, resulting in a total of 61,250 gene models in the rice genome. We incorporated experimental evidence into 18,252 gene models to improve the quality of the structural annotation. A series of functional data types has been annotated for the rice genome that includes alignment with genetic markers, assignment of gene ontologies, identification of flanking sequence tags, alignment with homologs from related species, and syntenic mapping with other cereal species. All structural and functional annotation data are available through interactive search and display windows as well as through download of flat files. To integrate the data with other genome projects, the annotation data are available through a Distributed Annotation System and a Genome Browser. All data can be obtained through the project Web pages at http://rice.tigr.org.  相似文献   

15.
Genomic sequence data are often available well before the annotated sequence is published. We present a method for analysis of genomic DNA to identify coding sequences using the GeneScan algorithm and characterize these resultant sequences by BLAST. The routines are used to develop a system for automated annotation of genome DNA sequences.  相似文献   

16.
Myristoylation by the myristoyl-CoA:protein N-myristoyltransferase (NMT) is an important lipid anchor modification of eukaryotic and viral proteins. Automated prediction of N-terminal N-myristoylation from the substrate protein sequence alone is necessary for large-scale sequence annotation projects but it requires a low rate of false positive hits in addition to a sufficient sensitivity.Our previous analysis of substrate protein sequence variability, NMT sequences and 3D structures has revealed motif properties in addition to the known PROSITE motif that are utilized in a new predictor described here. The composite prediction function (with separate ad hoc parameterization (a) for queries from non-fungal eukaryotes and their viruses and (b) for sequences from fungal species) consists of terms evaluating amino acid type preferences at sequences positions close to the N terminus as well as terms penalizing deviations from the physical property pattern of amino acid side-chains encoded in multi-residue correlation within the motif sequence. The algorithm has been validated with a self-consistency and two jack-knife tests for the learning set as well as with kinetic data for model substrates. The sensitivity in recognizing documented NMT substrates is above 95 % for both taxon-specific versions. The corresponding rate of false positive prediction (for sequences with an N-terminal glycine residue) is close to 0.5 %; thus, the technique is applicable for large-scale automated sequence database annotation. The predictor is available as public WWW-server with the URL http://mendel.imp.univie.ac.at/myristate/. Additionally, we propose a version of the predictor that identifies a number of proteolytic protein processing sites at internal glycine residues and that evaluates possible N-terminal myristoylation of the protein fragments.A scan of public protein databases revealed new potential NMT targets for which the myristoyl modification may be of critical importance for biological function. Among others, the list includes kinases, phosphatases, proteasomal regulatory subunit 4, kinase interacting proteins KIP1/KIP2, protozoan flagellar proteins, homologues of mitochondrial translocase TOM40, of the neuronal calcium sensor NCS-1 and of the cytochrome c-type heme lyase CCHL. Analyses of complete eukaryote genomes indicate that about 0.5 % of all encoded proteins are apparent NMT substrates except for a higher fraction in Arabidopsis thaliana ( approximately 0.8 %).  相似文献   

17.
Next‐generation technologies generate an overwhelming amount of gene sequence data. Efficient annotation tools are required to make these data amenable to functional genomics analyses. The Mercator pipeline automatically assigns functional terms to protein or nucleotide sequences. It uses the MapMan ‘BIN’ ontology, which is tailored for functional annotation of plant ‘omics’ data. The classification procedure performs parallel sequence searches against reference databases, compiles the results and computes the most likely MapMan BINs for each query. In the current version, the pipeline relies on manually curated reference classifications originating from the three reference organisms (Arabidopsis, Chlamydomonas, rice), various other plant species that have a reviewed SwissProt annotation, and more than 2000 protein domain and family profiles at InterPro, CDD and KOG. Functional annotations predicted by Mercator achieve accuracies above 90% when benchmarked against manual annotation. In addition to mapping files for direct use in the visualization software MapMan, Mercator provides graphical overview charts, detailed annotation information in a convenient web browser interface and a MapMan‐to‐GO translation table to export results as GO terms. Mercator is available free of charge via http://mapman.gabipd.org/web/guest/app/Mercator .  相似文献   

18.
19.
A wide range of web based prediction and annotation tools are frequently used for determining protein function from sequence. However, parallel processing of sequences for annotation through web tools is not possible due to several constraints in functional programming for multiple queries. Here, we propose the development of APAF as an automated protein annotation filter to overcome some of these difficulties through an integrated approach.  相似文献   

20.
UniRef: comprehensive and non-redundant UniProt reference clusters   总被引:2,自引:0,他引:2  
MOTIVATION: Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences. RESULTS: The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef100, UniRef90 and UniRef50 yield a database size reduction of approximately 10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis. AVAILABILITY: UniRef is updated biweekly and is available for online search and retrieval at http://www.uniprot.org, as well as for download at ftp://ftp.uniprot.org/pub/databases/uniprot/uniref. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号