首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Parallel BLAST on split databases   总被引:1,自引:0,他引:1  
SUMMARY: BLAST programs often run on large SMP machines where multiple threads can work simultaneously and there is enough memory to cache the databases between program runs. A group of programs is described which allows comparable performance to be achieved with a Beowulf configuration in which no node has enough memory to cache a database but the cluster as an aggregate does. To achieve this result, databases are split into equal sized pieces and stored locally on each node. Each query is run on all nodes in parallel and the resultant BLAST output files from all nodes merged to yield the final output. AVAILABILITY: Source code is available from ftp://saf.bio.caltech.edu/  相似文献   

2.
16S rRNA gene analysis is the most convenient and robust method for microbiome studies. Inaccurate taxonomic assignment of bacterial strains could have deleterious effects as all downstream analyses rely heavily on the accurate assessment of microbial taxonomy. The use of mock communities to check the reliability of the results has been suggested. However, often the mock communities used in most of the studies represent only a small fraction of taxa and are used mostly as validation of sequencing run to estimate sequencing artifacts. Moreover, a large number of databases and tools available for classification and taxonomic assignment of the 16S rRNA gene make it challenging to select the best-suited method for a particular dataset. In the present study, we used authentic and validly published 16S rRNA gene type strain sequences (full length, V3-V4 region) and analyzed them using a widely used QIIME pipeline along with different parameters of OTU clustering and QIIME compatible databases. Data Analysis Measures (DAM) revealed a high discrepancy in ratifying the taxonomy at different taxonomic hierarchies. Beta diversity analysis showed clear segregation of different DAMs. Limited differences were observed in reference data set analysis using partial (V3-V4) and full-length 16S rRNA gene sequences, which signify the reliability of partial 16S rRNA gene sequences in microbiome studies. Our analysis also highlights common discrepancies observed at various taxonomic levels using various methods and databases.  相似文献   

3.
Genomic sequence data are often available well before the annotated sequence is published. We present a method for analysis of genomic DNA to identify coding sequences using the GeneScan algorithm and characterize these resultant sequences by BLAST. The routines are used to develop a system for automated annotation of genome DNA sequences.  相似文献   

4.

Background  

With the exponential increase in genomic sequence data there is a need to develop automated approaches to deducing the biological functions of novel sequences with high accuracy. Our aim is to demonstrate how accuracy benchmarking can be used in a decision-making process evaluating competing designs of biological function predictors. We utilise the Gene Ontology, GO, a directed acyclic graph of functional terms, to annotate sequences with functional information describing their biological context. Initially we examine the effect on accuracy scores of increasing the allowed distance between predicted and a test set of curator assigned terms. Next we evaluate several annotator methods using accuracy benchmarking. Given an unannotated sequence we use the Basic Local Alignment Search Tool, BLAST, to find similar sequences that have already been assigned GO terms by curators. A number of methods were developed that utilise terms associated with the best five matching sequences. These methods were compared against a benchmark method of simply using terms associated with the best BLAST-matched sequence (best BLAST approach).  相似文献   

5.
SRS (Sequence Retrieval System) is a widely used keyword search engine for querying biological databases. BLAST2 is the most widely used tool to query databases by sequence similarity search. These tools allow users to retrieve sequences by shared keyword or by shared similarity, with many public web servers available. However, with the increasingly large datasets available it is now quite common that a user is interested in some subset of homologous sequences but has no efficient way to restrict retrieval to that set. By allowing the user to control SRS from the BLAST output, BLAST2SRS (http://blast2srs.embl.de/) aims to meet this need. This server therefore combines the two ways to search sequence databases: similarity and keyword.  相似文献   

6.
BLAST (Basic Local Alignment Search Tool) searches against DNA and protein sequence databases have become an indispensable tool for biomedical research. The proliferation of the genome sequencing projects is steadily increasing the fraction of genome-derived sequences in the public databases and their importance as a public resource. We report here the availability of Genomic BLAST, a novel graphical tool for simplifying BLAST searches against complete and unfinished genome sequences. This tool allows the user to compare the query sequence against a virtual database of DNA and/or protein sequences from a selected group of organisms with finished or unfinished genomes. The organisms for such a database can be selected using either a graphic taxonomy-based tree or an alphabetical list of organism-specific sequences. The first option is designed to help explore the evolutionary relationships among organisms within a certain taxonomy group when performing BLAST searches. The use of an alphabetical list allows the user to perform a more elaborate set of selections, assembling any given number of organism-specific databases from unfinished or complete genomes. This tool, available at the NCBI web site http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/genom_table_cgi, currently provides access to over 170 bacterial and archaeal genomes and over 40 eukaryotic genomes.  相似文献   

7.
8.
We developed nearly 700 non-redundant 2- or 3-base simple sequence repeat (SSR) markers from tomato using sequence data obtained from open genome databases. Among various types of core motifs, AT was most abundant in SSRs derived from cDNAs (~53%) and bacterial artificial chromosome (BAC) ends (~72%). There was a positive correlation between the rate of detection of polymorphic alleles (heterozygosity value; Hv) and the repeat number of the core motif in all markers showing polymorphisms among at least one pair of six cultivars or lines tested (r = 0.566**). The average Hv of BAC-end-derived SSR markers (~0.5) was higher than that of cDNA-derived markers (~0.3). These characteristics of BAC-end-derived SSRs are useful for genetic studies using closely related cultivars and lines. However, BAC-end-derived SSRs tended to cluster in centromeric regions (~80%). A scheme for the construction of a high-density linkage map of tomato is discussed.  相似文献   

9.
10.

Background  

High throughput technologies often require the retrieval of large data sets of sequences. Retrieval of EMBL or GenBank entries using keywords is easy using tools such as ACNUC, Entrez or SRS, but has some limitations, in particular when querying with complex keywords.  相似文献   

11.
Xu H  Yang L  Xu P  Tao Y  Ma Z 《Proteomics》2007,7(2):177-179
cTrans is a comprehensive utility used to generate polypeptide databases from cDNA sequences. The goal is achieved through integrating four main functions, including retrieving sequences of species of interest from the downloaded packages from dbEST of GenBank, format conversion, checking and deleting vector and adaptor contamination, and translating the cDNA sequences in all six frames and selecting specific translations for database construction in a user-defined length threshold. In addition, this utility is also applicable to cDNA sequences produced by users themselves.  相似文献   

12.
PhyloBLAST is an internet-accessed application based on CGI/Perl programming that compares a users protein sequence to a SwissProt/TREMBL database using BLAST2 and then allows phylogenetic analyses to be performed on selected sequences from the BLAST output. Flexible features such as ability to input your own multiple sequence alignment and use PHYLIP program options provide additional web-based phylogenetic analysis functionality beyond the analysis of a BLAST result.  相似文献   

13.
The proteins in blood were all first expressed as mRNAs from genes within cells. There are databases of human proteins that are known to be expressed as mRNA in human cells and tissues. Proteins identified from human blood by the correlation of mass spectra that fail to match human mRNA expression products may not be correct. We compared the proteins identified in human blood by mass spectrometry by 10 different groups by correlation to human and nonhuman nucleic acid sequences. We determined whether the peptides or proteins identified by the different groups mapped to the human known proteins of the Reference Sequence (RefSeq) database. We used Structured Query Language data base searches of the peptide sequences correlated to tandem mass spectrometry spectra and basic local alignment search tool analysis of the identified full length proteins to control for correlation to the wrong peptide sequence or the existence of the same or very similar peptide sequence shared by more than one protein. Mass spectra were correlated against large protein data bases that contain many sequences that may not be expressed in human beings yet the search returned a very high percentage of peptides or proteins that are known to be found in humans. Only about 5% of proteins mapped to hypothetical sequences, which is in agreement with the reported false-positive rate of searching algorithms conditions. The results were highly enriched in secreted and soluble proteins and diminished in insoluble or membrane proteins. Most of the proteins identified were relatively short and showed a similar size distribution compared to the RefSeq database. At least three groups agree on a nonredundant set of 1671 types of proteins and a nonredundant set of 3151 proteins were identified by at least three peptides.  相似文献   

14.
15.
We present a tool suited for searching for many short nucleotide sequences in large databases, allowing for a predefined number of gaps and mismatches. The commandline-driven program implements a non-deterministic automata matching algorithm on a keyword tree of the search strings. Both queries with and without ambiguity codes can be searched. Search time is short for perfect matches, and retrieval time rises exponentially with the number of edits allowed. AVAILABILITY: The C++ source code for PatMaN is distributed under the GNU General Public License and has been tested on the GNU/Linux operating system. It is available from http://bioinf.eva.mpg.de/patman. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

16.
We have determined the nucleotide sequences of zein cDNA clones ZG14, ZG15, and ZG35. The three clones have 95 to 98% homology to the previously published sequence of clone A20, and 84% homology to sequences of the zein subfamily A30. Comparison of all sequences of the A30 and A20 subfamilies highlights the following features: the 5' nontranslated regions are 68 and 57 nucleotides in length for the A20- and A30-like mRNAs, respectively, and contain at least three repeats of the consensus sequence ACGAACAAta/gG; the majority of these genes are highly clustered as judged from pulsed-field gel electrophoresis of high molecular weight maize DNA. Furthermore, we discuss a model for the evolution of the multigene family which stresses the special importance of unequal crossingover and gene conversion in this system.  相似文献   

17.
ViroBLAST is a stand-alone BLAST web interface for nucleotide and amino acid sequence similarity searches. It extends the utility of BLAST to query against multiple sequence databases and user sequence datasets, and provides a friendly output to easily parse and navigate BLAST results. ViroBLAST is readily useful for all research areas that require BLAST functions and is available online and as a downloadable archive for independent installation. Availability: http://indra.mullins.microbiol.washington.edu/blast/viroblast.php.  相似文献   

18.
MOTIVATION: Sequence databases represent an enormous resource of phylogenetic information, but there is a lack of tools for accessing that information in order to assess the amount of evolutionary information in these databases that may be suitable for phylogenetic reconstruction and for identifying areas of the taxonomy that are under-represented for specific gene sequences. RESULTS: We have developed TreeGeneBrowser which allows inspection and evaluation of gene sequence data for phylogenetic reconstruction. This program improves the efficiency of identification of genes that may be useful for particular phylogenetic studies and identifies taxa and taxonomic branches that are under-represented in sequence databases.  相似文献   

19.
Multiplexed tandem mass spectrometry (MS/MS) has recently been demonstrated as a means to increase the throughput of peptide identification in liquid chromatography (LC) MS/MS experiments. In this approach, a set of parent species is dissociated simultaneously and measured in a single spectrum (in the same manner that a single parent ion is conventionally studied), providing a gain in sensitivity and throughput proportional to the number of species that can be simultaneously addressed. In the present work, simulations performed using the Caenorhabditis elegans predicted proteins database show that multiplexed MS/MS data allow the identification of tryptic peptides from mixtures of up to ten peptides from a single dataset with only three "y" or "b" fragments per peptide and a mass accuracy of 2.5 to 5 ppm. At this level of database and data complexity, 98% of the 500 peptides considered in the simulation were correctly identified. This compares favorably with the rates obtained for classical MS/MS at more modest mass measurement accuracy. LC multiplexed Fourier transform-ion cyclotron resonance MS/MS data obtained from a 66 kDa protein (bovine serum albumin) tryptic digest sample are presented to illustrate the approach, and confirm that peptides can be effectively identified from the C. elegans database to which the protein sequence had been appended.  相似文献   

20.
Some DNA sequences in the International Nucleotide Sequence Databases (INSD) are erroneously annotated, which has lead to misleading conclusions in publications. Ophiocordyceps sinensis (syn. Cordyceps sinensis) is a fungus endemic to the Tibetan Plateau, and more than 100 populations covering almost its distribution area have been examined by us over recent years. In this study, using the data from authentic materials, we have evaluated the reliability of nucleotide sequences annotated as O. sinensis in the INSD. As of October 15, 2012, the INSD contained 874 records annotated as O. sinensis, including 555 records representing nuclear ribosomal DNA (63.5 %), 197 representing protein-coding genes (22.5 %), 92 representing random markers with unknown functions (10.5 %), and 30 representing microsatellite loci (3.5 %). Our analysis indicated that 39 of the 397 internal transcribed spacer entries, 27 of the 105 small subunit entries, and five of the 53 large subunit entries were incorrectly annotated as belonging to O. sinensis. For protein-coding sequences, all records of serine protease genes, the mating-type gene MAT1-2-1, the DNA lyase gene, the two largest subunits of RNA polymerase II, and elongation factor-1α gene were correct, while 14 of the 73 β-tubulin entries were indeterminate. Genetic diversity analyses using those sequences correctly identified as O. sinensis revealed significant genetic differentiation in the fungus although the extent of genetic differentiation varied with the gene. The relationship between O. sinensis and some other related fungal taxa is also discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号