首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
The measurement of biodiversity is an integral aspect of life science research. With the establishment of second- and third-generation sequencing technologies, an increasing amount of metabarcoding data is being generated as we seek to describe the extent and patterns of biodiversity in multiple contexts. The reliability and accuracy of taxonomically assigning metabarcoding sequencing data have been shown to be critically influenced by the quality and completeness of reference databases. Custom, curated, eukaryotic reference databases, however, are scarce, as are the software programs for generating them. Here, we present crabs (Creating Reference databases for Amplicon-Based Sequencing), a software package to create custom reference databases for metabarcoding studies. crabs includes tools to download sequences from multiple online repositories (i.e., NCBI, BOLD, EMBL, MitoFish), retrieve amplicon regions through in silico PCR analysis and pairwise global alignments, curate the database through multiple filtering parameters (e.g., dereplication, sequence length, sequence quality, unresolved taxonomy, inclusion/exclusion filter), export the reference database in multiple formats for immediate use in taxonomy assignment software, and investigate the reference database through implemented visualizations for diversity, primer efficiency, reference sequence length, database completeness and taxonomic resolution. crabs is a versatile tool for generating curated reference databases of user-specified genetic markers to aid taxonomy assignment from metabarcoding sequencing data. crabs can be installed via docker and is available for download as a conda package and via GitHub ( https://github.com/gjeunen/reference_database_creator ).  相似文献   

2.
The typical wet lab user often annotates smaller sequences in the GenBank format, but resulting files are not accepted for database submission by NCBI. This makes submission of such annotations a cumbersome task. Here we present “GB2sequin” an easy-to-use web application that converts custom annotations in the GenBank format into the NCBI direct submission format Sequin. Additionally, the program generates a “five-column, tab-delimited feature table” and a FASTA file. Those are required for submission through BankIt or the update of an existing GenBank entry. We specifically developed “GB2sequin” for the regular wet lab researcher with strong focus on user-friendliness and flexibility. The application is equipped with an intuitive graphical interface and a comprehensive documentation. It can be employed to prepare any GenBank file for database submission and is freely available online at https://chlorobox.mpimp-golm.mpg.de/GenBank2Sequin.html.  相似文献   

3.
4.
MOTIVATION: Numerous database management systems have been developed for processing various taxonomic data bases on biological classification or phylogenetic information. In this paper, we present an integrated system to deal with interacting classifications and phylogenies concerning particular taxonomic groups. RESULTS: An information-theoretic view (taxon view) has been applied to capture taxonomic concepts as taxonomic data entities. A data model which is suitable for supporting semantically interacting dynamic views of hierarchic classifications and a query method for interacting classifications have been developed. The concept of taxonomic view and the data model can also be expanded to carry phylogenetic information in phylogenetic trees. We have designed a prototype taxonomic database system called HICLAS (HIerarchical CLAssification System) based on the concept of taxon view, and the data models and query methods have been designed and implemented. This system can be effectively used in the taxonomic revisionary process, especially when databases are being constructed by specialists in particular groups, and the system can be used to compare classifications and phylogenetic trees. AVAILABILITY: Freely available at the WWW URL: http://aims.cps.msu.edu/hiclas/ CONTACT: pramanik@cps.msu.edu; lotus@wipm.whcnc.ac.cn  相似文献   

5.
Histone and histone fold sequences and structures: a database.   总被引:4,自引:3,他引:1       下载免费PDF全文
A database of aligned histone protein sequences has been constructed based on the results of homology searches of the major public sequence databases. In addition, sequences of proteins identified as containing the histone fold motif and structures of all known histone and histone fold proteins have been included in the current release. Database resources include information on conflicts between similar sequence entries in different source databases, multiple sequence alignments, and links to the Entrez integrated information retrieval system at the National Center for Biotechnology Information (NCBI). The database currently contains over 1000 protein sequences. All sequences and alignments in this database are available through the World Wide Web at: http: //www.ncbi.nlm.nih.gov/Baxevani/HISTONES/ .  相似文献   

6.
Nucleotide sequence and taxonomy reference databases are critical resources for widespread applications including marker-gene and metagenome sequencing for microbiome analysis, diet metabarcoding, and environmental DNA (eDNA) surveys. Reproducibly generating, managing, using, and evaluating nucleotide sequence and taxonomy reference databases creates a significant bottleneck for researchers aiming to generate custom sequence databases. Furthermore, database composition drastically influences results, and lack of standardization limits cross-study comparisons. To address these challenges, we developed RESCRIPt, a Python 3 software package and QIIME 2 plugin for reproducible generation and management of reference sequence taxonomy databases, including dedicated functions that streamline creating databases from popular sources, and functions for evaluating, comparing, and interactively exploring qualitative and quantitative characteristics across reference databases. To highlight the breadth and capabilities of RESCRIPt, we provide several examples for working with popular databases for microbiome profiling (SILVA, Greengenes, NCBI-RefSeq, GTDB), eDNA and diet metabarcoding surveys (BOLD, GenBank), as well as for genome comparison. We show that bigger is not always better, and reference databases with standardized taxonomies and those that focus on type strains have quantitative advantages, though may not be appropriate for all use cases. Most databases appear to benefit from some curation (quality filtering), though sequence clustering appears detrimental to database quality. Finally, we demonstrate the breadth and extensibility of RESCRIPt for reproducible workflows with a comparison of global hepatitis genomes. RESCRIPt provides tools to democratize the process of reference database acquisition and management, enabling researchers to reproducibly and transparently create reference materials for diverse research applications. RESCRIPt is released under a permissive BSD-3 license at https://github.com/bokulich-lab/RESCRIPt.  相似文献   

7.
Comparing bacterial 16S rDNA sequences to GenBank and other large public databases via BLAST often provides results of little use for identification and taxonomic assignment of the organisms of interest. The human microbiome, and in particular the oral microbiome, includes many taxa, and accurate identification of sequence data is essential for studies of these communities. For this purpose, a phylogenetically curated 16S rDNA database of the core oral microbiome, CORE, was developed. The goal was to include a comprehensive and minimally redundant representation of the bacteria that regularly reside in the human oral cavity with computationally robust classification at the level of species and genus. Clades of cultivated and uncultivated taxa were formed based on sequence analyses using multiple criteria, including maximum-likelihood-based topology and bootstrap support, genetic distance, and previous naming. A number of classification inconsistencies for previously named species, especially at the level of genus, were resolved. The performance of the CORE database for identifying clinical sequences was compared to that of three publicly available databases, GenBank nr/nt, RDP and HOMD, using a set of sequencing reads that had not been used in creation of the database. CORE offered improved performance compared to other public databases for identification of human oral bacterial 16S sequences by a number of criteria. In addition, the CORE database and phylogenetic tree provide a framework for measures of community divergence, and the focused size of the database offers advantages of efficiency for BLAST searching of large datasets. The CORE database is available as a searchable interface and for download at http://microbiome.osu.edu.  相似文献   

8.
Background: A main goal of metagenomics is taxonomic characterization of microbial communities. Although sequence comparison has been the main method for the taxonomic classification, there is not a clear agreement on similarity calculation and similarity thresholds, especially at higher taxonomic levels such as phylum and class. Thus taxonomic classification of novel metagenomic sequences without close homologs in the biological databases poses a challenge. Methods: In this study, we propose to use the co-abundant associations between taxa/operational taxonomic units (OTU) across complex and diverse communities to assist taxonomic classification. We developed a Markov Random Field model to predict taxa of unknown microorganisms using co-abundant associations. Results: Although such associations are intrinsically functional associations, we demonstrate that they are strongly correlated with taxonomic associations and can be combined with sequence comparison methods to predict taxonomic origins of unknown microorganisms at phylum and class levels. Conclusions: With the ever-increasing accumulation of sequence data from microbial communities, we now take the first step to explore these associations for taxonomic identification beyond sequence similarity. Availability and Implementation: Source codes of TACO are freely available at the following URL: https://github.com/baharvand/OTU-Taxonomy-Identification implemented in C++, supported on Linux and MS Windows.  相似文献   

9.
Internal Ribosome Entry Sites (IRES) are cis-acting RNA sequences able to mediate internal entry of the 40S ribosomal subunit on some eukaryotic and viral messenger RNAs upstream of a translation initiation codon. These sequences are very diverse and are present in a growing list of mRNAs. Novel IRES sequences continue to be added to public databases every year and the list of unknown IRESes is certainly still very large. The IRES database is a comprehensive WWW resource for internal ribosome entry sites and presents currently available general information as well as detailed data for each IRES. It is a searchable, periodically updated collection of IRES RNA sequences. Sequences are presented in FASTA form and hotlinked to NCBI GenBank files. Several subsets of data are classified according to the viral taxon (for viral IRESes), to the gene product function (for cellular IRESes), to the possible cellular regulation or to the trans-acting factor that mediates IRES function. This database is accessible at http://ifr31w3.toulouse.inserm.fr/IRESdatabase/.  相似文献   

10.
Identification of North Sea molluscs with DNA barcoding   总被引:1,自引:0,他引:1       下载免费PDF全文
Sequence‐based specimen identification, known as DNA barcoding, is a common method complementing traditional morphology‐based taxonomic assignments. The fundamental resource in DNA barcoding is the availability of a taxonomically reliable sequence database to use as a reference for sequence comparisons. Here, we provide a reference library including 579 sequences of the mitochondrial cytochrome c oxidase subunit I for 113 North Sea mollusc species. We tested the efficacy of this library by simulating a sequence‐based specimen identification scenario using Best Match, Best Close Match (BCM) and All Species Barcode (ASB) criteria with three different threshold values. Each identification result was compared with our prior morphology‐based taxonomic assignments. Our simulation resulted in 87.7% congruent identifications (93.8% when excluding singletons). The highest number of congruent identifications was obtained with BCM and ASB and a 0.05 threshold. We also compared identifications with genetic clustering (Barcode Index Numbers, BINs) computed by the Barcode of Life Datasystem (BOLD). About 68% of our morphological identifications were congruent with BINs created by BOLD. Forty‐nine sequences were clustered in 16 discordant BINs, and these were divided in two classes: sequences from different species clustered in a single BIN and conspecific sequences divided in more BINs. Whereas former incongruences were probably caused by BOLD entries in need of a taxonomic update, the latter incongruences regarded taxa requiring further investigations. These include species with amphi‐Atlantic distribution, whose genetic structure should be evaluated over their entire range to produce a reliable sequence‐based identification system.  相似文献   

11.
GSTaxClassifier (Genomic Signature based Taxonomic Classifier) is a program for metagenomics analysis of shotgun DNA sequences. The program includes
  1. a simple but effective algorithm, a modification of the Bayesian method, to predict the most probable genomic origins of sequences at different taxonomical ranks, on the basis of genome databases;
  2. a function to generate genomic profiles of reference sequences with tri-, tetra-, penta-, and hexa-nucleotide motifs for setting a user-defined database;
  3. two different formats (tabular- and tree-based summaries) to display taxonomic predictions with improved analytical methods; and
  4. effective ways to retrieve, search, and summarize results by integrating the predictions into the NCBI tree-based taxonomic information.
GSTaxClassifier takes input nucleotide sequences and using a modified Bayesian model evaluates the genomic signatures between metagenomic query sequences and reference genome databases. The simulation studies of a numerical data sets showed that GSTaxClassifier could serve as a useful program for metagenomics studies, which is freely available at http://helix2.biotech.ufl.edu:26878/metagenomics/.  相似文献   

12.
The National Center for Biotechnology Information (NCBI) integrates data from more than 20 biological databases through a flexible search and retrieval system called Entrez. A core Entrez database, Entrez Nucleotide, includes GenBank and is tightly linked to the NCBI Taxonomy database, the Entrez Protein database, and the scientific literature in PubMed. A suite of more specialized databases for genomes, genes, gene families, gene expression, gene variation, and protein domains dovetails with the core databases to make Entrez a powerful system for genomic research. Linked to the full range of Entrez databases is the NCBI Map Viewer, which displays aligned genetic, physical, and sequence maps for eukaryotic genomes including those of many plants. A specialized plant query page allow maps from all plant genomes covered by the Map Viewer to be searched in tandem to produce a display of aligned maps from several species. PlantBLAST searches against the sequences shown in the Map Viewer allow BLAST alignments to be viewed within a genomic context. In addition, precomputed sequence similarities, such as those for proteins offered by BLAST Link, enable fluid navigation from unannotated to annotated sequences, quickening the pace of discovery. NCBI Web pages for plants, such as Plant Genome Central, complete the system by providing centralized access to NCBI's genomic resources as well as links to organism-specific Web pages beyond NCBI.  相似文献   

13.
BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences   总被引:49,自引:0,他引:49  
'BLAST 2 Sequences', a new BLAST-based tool for aligning two protein or nucleotide sequences, is described. While the standard BLAST program is widely used to search for homologous sequences in nucleotide and protein databases, one often needs to compare only two sequences that are already known to be homologous, coming from related species or, e.g. different isolates of the same virus. In such cases searching the entire database would be unnecessarily time-consuming. 'BLAST 2 Sequences' utilizes the BLAST algorithm for pairwise DNA-DNA or protein-protein sequence comparison. A World Wide Web version of the program can be used interactively at the NCBI WWW site (http://www.ncbi.nlm.nih.gov/gorf/bl2.++ +html). The resulting alignments are presented in both graphical and text form. The variants of the program for PC (Windows), Mac and several UNIX-based platforms can be downloaded from the NCBI FTP site (ftp://ncbi.nlm.nih.gov).  相似文献   

14.
Photosynthetic eukaryotes have a critical role as the main producers in most ecosystems of the biosphere. The ongoing environmental metabarcoding revolution opens the perspective for holistic ecosystems biological studies of these organisms, in particular the unicellular microalgae that often lack distinctive morphological characters and have complex life cycles. To interpret environmental sequences, metabarcoding necessarily relies on taxonomically curated databases containing reference sequences of the targeted gene (or barcode) from identified organisms. To date, no such reference framework exists for photosynthetic eukaryotes. In this study, we built the PhytoREF database that contains 6490 plastidial 16S rDNA reference sequences that originate from a large diversity of eukaryotes representing all known major photosynthetic lineages. We compiled 3333 amplicon sequences available from public databases and 879 sequences extracted from plastidial genomes, and generated 411 novel sequences from cultured marine microalgal strains belonging to different eukaryotic lineages. A total of 1867 environmental Sanger 16S rDNA sequences were also included in the database. Stringent quality filtering and a phylogeny‐based taxonomic classification were applied for each 16S rDNA sequence. The database mainly focuses on marine microalgae, but sequences from land plants (representing half of the PhytoREF sequences) and freshwater taxa were also included to broaden the applicability of PhytoREF to different aquatic and terrestrial habitats. PhytoREF, accessible via a web interface ( http://phytoref.fr ), is a new resource in molecular ecology to foster the discovery, assessment and monitoring of the diversity of photosynthetic eukaryotes using high‐throughput sequencing.  相似文献   

15.
Cyanobacteria are photosynthetic bacteria that occupy various habitats across the globe, playing critical roles in many of Earth's biogeochemical cycles both in both aquatic and terrestrial systems. Despite their well-known significance, their taxonomy remains problematic and is the subject of much research. Taxonomic issues of Cyanobacteria have consequently led to inaccurate curation within known reference databases, ultimately leading to problematic taxonomic assignment during diversity studies. Recent advances in sequencing technologies have increased our ability to characterize and understand microbial communities, leading to the generation of thousands of sequences that require taxonomic assignment. We herein propose CyanoSeq ( https://zenodo.org/record/7569105 ), a database of cyanobacterial 16S rRNA gene sequences with curated taxonomy. The taxonomy of CyanoSeq is based on the current state of cyanobacterial taxonomy, with ranks from the domain to genus level. Files are provided for use with common naive Bayes taxonomic classifiers, such as those included in DADA2 or the QIIME2 platform. Additionally, FASTA files are provided for creation of de novo phylogenetic trees with (near) full-length 16S rRNA gene sequences to determine the phylogenetic relationship of cyanobacterial strains and/or ASV/OTUs. The database currently consists of 5410 cyanobacterial 16S rRNA gene sequences along with 123 Chloroplast, Bacterial, and Vampirovibrionia (formally Melainabacteria) sequences.  相似文献   

16.
The first step of any molecular phylogenetic analysis is the selection of the species and sequences to be included, the taxon sampling. Already here different pitfalls exist. Sequences can contain errors, annotations in databases can be inaccurate and even the taxonomic classification of a species can be wrong. Usually, these artefacts become evident only after calculation of the phylogenetic tree. Following, the taxon sampling has to be corrected iteratively. This can become tedious and time consuming, as in most cases the taxon sampling is de-coupled from the further steps of the phylogenetic analysis. Here, we present the ITS2 Workbench (http://its2.bioapps.biozentrum.uni-wuerzburg.de/), which eliminates this problem by a tight integration of taxon sampling, secondary structure prediction, multiple alignment and phylogenetic tree calculation. The ITS2 Workbench has access to more than 280,000 ITS2 sequences and their structures provided by the ITS2 database enabling sequence-structure based alignment and tree reconstruction. This allows the interactive improvement of the taxon sampling throughout the whole phylogenetic tree reconstruction process. Thus, the ITS2 Workbench enables a fast, interactive and iterative taxon sampling leading to more accurate ITS2 based phylogenies.  相似文献   

17.
18.
Around 27,000 prokaryote genomes are presently deposited in the Genome database of GenBank at the National Center for Biotechnology Information (NCBI) and this number is exponentially growing. However, it is not known how many of these genomes correspond correctly to their designated taxon. The taxonomic affiliation of 44 Aeromonas genomes (only five of these are type strains) deposited at the NCBI was determined by a multilocus phylogenetic analysis (MLPA) and by pairwise average nucleotide identity (ANI). Discordant results in relation to taxa assignation were found for 14 (35.9%) of the 39 non-type strain genomes on the basis of both the MLPA and ANI results. Data presented in this study also demonstrated that if the genome of the type strain is not available, a genome of the same species correctly identified can be used as a reference for ANI calculations. Of the three ANI calculating tools compared (ANI calculator, EzGenome and JSpecies), EzGenome and JSpecies provided very similar results. However, the ANI calculator provided higher intra- and inter-species values than the other two tools (differences within the ranges 0.06–0.82% and 0.92–3.38%, respectively). Nevertheless each of these tools produced the same species classification for the studied Aeromonas genomes. To avoid possible misinterpretations with the ANI calculator, particularly when values are at the borderline of the 95% cutoff, one of the other calculation tools (EzGenome or JSpecies) should be used in combination. It is recommended that once a genome sequence is obtained the correct taxonomic affiliation is verified using ANI or a MLPA before it is submitted to the NCBI and that researchers should amend the existing taxonomic errors present in databases.  相似文献   

19.
MOTIVATION: Biological sequence databases are highly redundant for two main reasons: 1. various databanks keep redundant sequences with many identical and nearly identical sequences 2. natural sequences often have high sequence identities due to gene duplication. We wanted to know how many sequences can be removed before the databases start losing homology information. Can a database of sequences with mutual sequence identity of 50% or less provide us with the same amount of biological information as the original full database? RESULTS: Comparisons of nine representative sequence databases (RSDB) derived from full protein databanks showed that the information content of sequence databases is not linearly proportional to its size. An RSDB reduced to mutual sequence identity of around 50% (RSDB50) was equivalent to the original full database in terms of the effectiveness of homology searching. It was a third of the full database size which resulted in a six times faster iterative profile searching. The RSDBs are produced at different granularity for efficient homology searching. AVAILABILITY: All the RSDB files generated and the full analysis results are available through internet: ftp://ftp.ebi.ac. uk/pub/contrib/jong/RSDB/http://cyrah.e bi.ac.uk:1111/Proj/Bio/RSDB  相似文献   

20.
Vertebrate MitBASE is a specialized database where all the vertebrate mitochondrial DNA entries from primary databases are collected, revised and integrated with new information emerging from the literature. Variant sequences are also analyzed, aligned and linked to reference sequences. Data related to the same species and fragment can be viewed over the WWW. The database has a flexible interface and a retrieval system to help non-expert users and contains information not currently available in the primary databases. Vertebrate MitBASE is now available through the MitBASE home page at URL: http://www.ebi.ac.uk/htbin/Mitbase/mitb ase.pl. This work is part of a larger project, MitBASE which is a network of databases covering the full panorama of knowledge on mitochondrial DNA from protists to human sequences.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号