首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Unbiased high-throughput sequencing of whole metagenome shotgun DNA libraries is a promising new approach to identifying microbes in clinical specimens, which, unlike other techniques, is not limited to known sequences. Unlike most sequencing applications, it is highly sensitive to laboratory contaminants as these will appear to originate from the clinical specimens. To assess the extent and diversity of sequence contaminants, we aligned 57 “1000 Genomes Project” sequencing runs from six centers against the four largest NCBI BLAST databases, detecting reads of diverse contaminant species in all runs and identifying the most common of these contaminant genera (Bradyrhizobium) in assembled genomes from the NCBI Genome database. Many of these microorganisms have been reported as contaminants of ultrapure water systems. Studies aiming to identify novel microbes in clinical specimens will greatly benefit from not only preventive measures such as extensive UV irradiation of water and cross-validation using independent techniques, but also a concerted effort to sequence the complete genomes of common contaminants so that they may be subtracted computationally.  相似文献   

2.
本研究以常用于动物种属鉴定的12S rRNA基因位点为研究对象,利用所测得的17种常见涉案兽类12S rRNA基因部分片段序列及NCBI数据库中下载的该物种DNA序列及其近缘物种DNA序列,构建系统进化树。根据进化树的聚类情况,判断NCBI数据库中的相关基因序列或物种名称的正确性,并对其中错误序列的登陆号进行标记,以防对后续涉案动物的准确鉴定造成影响。分别从17种常见涉案兽类(共26份样本)中提取线粒体DNA,并利用通用引物扩增线粒体DNA上的12S rRNA基因部分片段并进行测序分析。通过NCBI数据库的Blast比对功能,筛选出与本研究物种同源性由高到低的物种,并从NCBI基因数据库中下载此类近缘物种的12S rRNA基因序列共351条,利用MEGA7.0软件构建该物种及其近缘物种系统进化树。通过比对发现NCBI中登录号为KP202279等3个序列所对应物种拉丁名错误。登录号为AY184436等11个序列所对应物种拉丁名可能存在疑问。GenBank中某些物种拉丁名有同种异名现象。因此,NCBI数据库数据可靠性有待进一步验证,只能作为涉案物种鉴定的参考数据之一,可借助构建系统进化树等方法来确认其结果的准确性。  相似文献   

3.
The National Center for Biotechnology Information (NCBI) integrates data from more than 20 biological databases through a flexible search and retrieval system called Entrez. A core Entrez database, Entrez Nucleotide, includes GenBank and is tightly linked to the NCBI Taxonomy database, the Entrez Protein database, and the scientific literature in PubMed. A suite of more specialized databases for genomes, genes, gene families, gene expression, gene variation, and protein domains dovetails with the core databases to make Entrez a powerful system for genomic research. Linked to the full range of Entrez databases is the NCBI Map Viewer, which displays aligned genetic, physical, and sequence maps for eukaryotic genomes including those of many plants. A specialized plant query page allow maps from all plant genomes covered by the Map Viewer to be searched in tandem to produce a display of aligned maps from several species. PlantBLAST searches against the sequences shown in the Map Viewer allow BLAST alignments to be viewed within a genomic context. In addition, precomputed sequence similarities, such as those for proteins offered by BLAST Link, enable fluid navigation from unannotated to annotated sequences, quickening the pace of discovery. NCBI Web pages for plants, such as Plant Genome Central, complete the system by providing centralized access to NCBI's genomic resources as well as links to organism-specific Web pages beyond NCBI.  相似文献   

4.
Molecular identification of mixed‐species pollen samples has a range of applications in various fields of research. To date, such molecular identification has primarily been carried out via amplicon sequencing, but whole‐genome shotgun (WGS) sequencing of pollen DNA has potential advantages, including (1) more genetic information per sample and (2) the potential for better quantitative matching. In this study, we tested the performance of WGS sequencing methodology and publicly available reference sequences in identifying species and quantifying their relative abundance in pollen mock communities. Using mock communities previously analyzed with DNA metabarcoding, we sequenced approximately 200Mbp for each sample using Illumina HiSeq and MiSeq. Taxonomic identifications were based on the Kraken k‐mer identification method with reference libraries constructed from full‐genome and short read archive data from the NCBI database. We found WGS to be a reliable method for taxonomic identification of pollen with near 100% identification of species in mixtures but generating higher rates of false positives (reads not identified to the correct taxon at the required taxonomic level) relative to rbcL and ITS2 amplicon sequencing. For quantification of relative species abundance, WGS data provided a stronger correlation between pollen grain proportion and sequence read proportion, but diverged more from a 1:1 relationship, likely due to the higher rate of false positives. Currently, a limitation of WGS‐based pollen identification is the lack of representation of plant diversity in publicly available genome databases. As databases improve and costs drop, we expect that eventually genomics methods will become the methods of choice for species identification and quantification of mixed‐species pollen samples.  相似文献   

5.
6.
Huntley MA  Golding GB 《Proteins》2002,48(1):134-140
A simple sequence is abundant in the proteins that have been sequenced to date. But unusual protein features, such as a simple sequence, are not present in the same high frequency within structural databases. A subset of these simple sequences, a group with a highly repetitive nature has been shown to be abundant in eukaryotes but not in prokaryotes. In this study, an examination of the eukaryotic proteins in the Protein Data Bank (PDB) has revealed a large deficiency of low complexity, highly repetitive protein repeats. Through simulated databases of similar samples of eukaryotic proteins taken from the National Center for Biotechnology Information (NCBI) database, it is shown that the PDB contains a significantly less highly repetitive, simple sequence than artificial databases of similar composition randomly derived from NCBI. When the structural data for those few PDB sequences that did contain a highly repetitive simple sequence is examined in detail, it is found that in most cases the tertiary structure is unknown for the regions consisting of a simple sequence. This lack of a simple sequence both in the PDB database and in the structural information suggests that this type of simple sequence may produce disordered structures that make structural characterization difficult.  相似文献   

7.
8.
Low-biomass samples from nitrate and heavy metal contaminated soils yield DNA amounts that have limited use for direct, native analysis and screening. Multiple displacement amplification (MDA) using phi29 DNA polymerase was used to amplify whole genomes from environmental, contaminated, subsurface sediments. By first amplifying the genomic DNA (gDNA), biodiversity analysis and gDNA library construction of microbes found in contaminated soils were made possible. The MDA method was validated by analyzing amplified genome coverage from approximately five Escherichia coli cells, resulting in 99.2% genome coverage. The method was further validated by confirming overall representative species coverage and also an amplification bias when amplifying from a mix of eight known bacterial strains. We extracted DNA from samples with extremely low cell densities from a U.S. Department of Energy contaminated site. After amplification, small-subunit rRNA analysis revealed relatively even distribution of species across several major phyla. Clone libraries were constructed from the amplified gDNA, and a small subset of clones was used for shotgun sequencing. BLAST analysis of the library clone sequences showed that 64.9% of the sequences had significant similarities to known proteins, and "clusters of orthologous groups" (COG) analysis revealed that more than half of the sequences from each library contained sequence similarity to known proteins. The libraries can be readily screened for native genes or any target of interest. Whole-genome amplification of metagenomic DNA from very minute microbial sources, while introducing an amplification bias, will allow access to genomic information that was not previously accessible. The reported SSU rRNA sequences and library clone end sequences are listed with their respective GenBank accession numbers, DQ 404590 to DQ 404652, DQ 404654 to DQ 404938, and DX 385314 to DX 389173.  相似文献   

9.
Sequence-based species identification relies on the extent and integrity of sequence data available in online databases such as GenBank. When identifying species from a sample of unknown origin, partial DNA sequences obtained from the sample are aligned against existing sequences in databases. When the sequence from the matching species is not present in the database, high-scoring alignments with closely related sequences might produce unreliable results on species identity. For species identification in mammals, the cytochrome b (cyt b) gene has been identified to be highly informative; thus, large amounts of reference sequence data from the cyt b gene are much needed. To enhance availability of cyt b gene sequence data on a large number of mammalian species in GenBank and other such publicly accessible online databases, we identified a primer pair for complete cyt b gene sequencing in mammals. Using this primer pair, we successfully PCR amplified and sequenced the complete cyt b gene from 40 of 44 mammalian species representing 10 orders of mammals. We submitted 40 complete, correctly annotated, cyt b protein coding sequences to GenBank. To our knowledge, this is the first single primer pair to amplify the complete cyt b gene in a broad range of mammalian species. This primer pair can be used for the addition of new cyt b gene sequences and to enhance data available on species represented in GenBank. The availability of novel and complete gene sequences as high-quality reference data can improve the reliability of sequence-based species identification.  相似文献   

10.
Low-biomass samples from nitrate and heavy metal contaminated soils yield DNA amounts that have limited use for direct, native analysis and screening. Multiple displacement amplification (MDA) using 29 DNA polymerase was used to amplify whole genomes from environmental, contaminated, subsurface sediments. By first amplifying the genomic DNA (gDNA), biodiversity analysis and gDNA library construction of microbes found in contaminated soils were made possible. The MDA method was validated by analyzing amplified genome coverage from approximately five Escherichia coli cells, resulting in 99.2% genome coverage. The method was further validated by confirming overall representative species coverage and also an amplification bias when amplifying from a mix of eight known bacterial strains. We extracted DNA from samples with extremely low cell densities from a U.S. Department of Energy contaminated site. After amplification, small-subunit rRNA analysis revealed relatively even distribution of species across several major phyla. Clone libraries were constructed from the amplified gDNA, and a small subset of clones was used for shotgun sequencing. BLAST analysis of the library clone sequences showed that 64.9% of the sequences had significant similarities to known proteins, and “clusters of orthologous groups” (COG) analysis revealed that more than half of the sequences from each library contained sequence similarity to known proteins. The libraries can be readily screened for native genes or any target of interest. Whole-genome amplification of metagenomic DNA from very minute microbial sources, while introducing an amplification bias, will allow access to genomic information that was not previously accessible.  相似文献   

11.
A substantial proportion of infections caused by drug-resistant Gram-negative bacteria (GNB) in community and health care settings are recognized to be caused by evolutionarily related GNB strains. Their global spread has been suggested to occur due to human activities, such as food trade and travel. These multidrug-resistant GNB pathogens often harbor mobile drug resistance genes that are highly conserved in their sequences. Because they appear across different GNB species, these genes may have origins other than human pathogens. We hypothesized that saprophytes in common human food products may serve as a reservoir for such genes. Between July 2007 and April 2008, we examined 25 batches of prepackaged retail spinach for cultivatable GNB population structure by 16S rRNA gene sequencing and for antimicrobial drug susceptibility testing and the presence of extended-spectrum beta-lactamase (ESBL) genes. We found 20 recognized GNB species among 165 (71%) of 231 randomly selected colonies cultured from spinach. Twelve strains suspected to express ESBLs based on resistance to cefotaxime and ceftazidime were further examined for bla(CTX-M) and bla(TEM) genes. We found a 712-bp sequence in Pseudomonas teessidea that was 100% identical to positions 10 to 722 of an 876-bp bla(CTX-M-15) gene of an E. coli strain. Additionally, we identified newly recognized ESBL bla(RAHN-2) sequences from Rahnella aquatilis. These observations demonstrate that saprophytes in common fresh produce can harbor drug resistance genes that are also found in internationally circulating strains of GNB pathogens; such a source may thus serve as a reservoir for drug resistance genes that ultimately enter pathogens to affect human health.  相似文献   

12.
The exponential growth of sequence data has become a challenge to database curators and end-users alike and biologists seeking to utilize the data effectively are faced with numerous analysis methods. Here, with practical examples from our bioinformatics analysis of the protein tyrosine phosphatases (PTPs), we show how computational analysis can be exploited to fuel hypothesis-driven experimental research through the exploration of online databases. We cover the following elements: (i) similarity searches and strategies to collect a non-redundant database of tyrosine-specific PTP domains; (ii) utilization of this database to classify human, fly, and worm PTPs (based on alignments and phylogenetic analysis); (iii) three-dimensional structural analysis to identify conserved regions (structure-function) and non-conserved selectivity-determining regions (substrate specificity); and (iv) genomic analysis, including mapping of exon structure, identification of pseudogenes, and exploration of disease databases. We discuss the importance of manual curation, illustrating examples in which pseudogenes give rise to predicted proteins in GenBank and note that domain servers, such as PFAM and SMART, erroneously include dual-specificity and lipid phosphatases in their collection of tyrosine-specific PTPs. To capitalize on our annotated set of 402 PTP domains (from 47 species and five phyla), we identify sequence conservation across taxonomic categories and explore structure-function relationships among tandem domain receptor-like PTPs. We define three Src homology 2 domain-containing PTP genes in stingray, zebrafish, and fugu and speculate on their evolutionary relationship with human pseudogenes. Our annotated sequences, along with a web service for phylogenetic classification of PTP domains, are available online (http://ptp.cshl.edu and http://science.novonordisk.com/ptp).  相似文献   

13.
EST sequencing of Onychophora and phylogenomic analysis of Metazoa   总被引:4,自引:0,他引:4  
Onychophora (velvet worms) represent a small animal taxon considered to be related to Euarthropoda. We have obtained 1873 5' cDNA sequences (expressed sequence tags, ESTs) from the velvet worm Epiperipatus sp., which were assembled into 833 contigs. BLAST similarity searches revealed that 51.9% of the contigs had matches in the protein databases with expectation values lower than 10(-4). Most ESTs had the best hit with proteins from either Chordata or Arthropoda (approximately 40% respectively). The ESTs included sequences of 27 ribosomal proteins. The orthologous sequences from 28 other species of a broad range of phyla were obtained from the databases, including other EST projects. A concatenated amino acid alignment comprising 5021 positions was constructed, which covers 4259 positions when problematic regions were removed. Bayesian and maximum likelihood methods place Epiperipatus within the monophyletic Ecdysozoa (Onychophora, Arthropoda, Tardigrada and Nematoda), but its exact relation to the Euarthropoda remained unresolved. The "Articulata" concept was not supported. Tardigrada and Nematoda formed a well-supported monophylum, suggesting that Tardigrada are actually Cycloneuralia. In agreement with previous studies, we have demonstrated that random sequencing of cDNAs results in sequence information suitable for phylogenomic approaches to resolve metazoan relationships.  相似文献   

14.
15.
Simple Sequence Repeats (SSRs) developed from Expressed Sequence Tags (ESTs), known as EST-SSRs are most widely used and potentially valuable source of gene based markers for their high levels of crosstaxon portability, rapid and less expensive development. The EST sequence information in the publicly available databases is increasing in a faster rate. The emerging computational approach provides a better alternative process of development of SSR markers from the ESTs than the conventional methods. In the present study, 12,851 EST sequences of Camellia sinensis, downloaded from National Center for Biotechnology Information (NCBI) were mined for the development of Microsatellites. 6148 (4779 singletons and 1369 contigs) non redundant EST sequences were found after preprocessing and assembly of these sequences using various computational tools. Out of total 3822.68 kb sequence examined, 1636 (26.61%) EST sequences containing 2371 SSRs were detected with a density of 1 SSR/1.61 kb leading to development of 245 primer pairs. These mined EST-SSR markers will help further in the study of variability, mapping, evolutionary relationship in Camellia sinensis. In addition, these developed SSRs can also be applied for various studies across species.  相似文献   

16.
Comprehensive complementary DNA (cDNA) library is a valuable resource for functional genomics. In this study, we set up a normalized cDNA library of Mo17 (MONL) by saturation hybridization with genomic DNA, which contained expressed genes of eight tissues and organs from inbred Mo17 of maize (Zea mays L.). In this library, the insert sizes range from 0.4 kb to 4 kb and the average size is 1.18 kb. 10.830 clones were spotted on nylon membrane to make a cDNA microarray. Randomly picked 300 clones from the cDNA library were sequenced. The cDNA microarry was hybridized with pooled tissue mRNA probes or housekeeping gene cDNA probes. The results showed the normalized cDNA library comprehensively includes tissue-specific genes in which 71% are unique ESTs (expressed sequence tags) based on the 300 sequences analyzed. Using BLAST program to compare the sequences against online nucleotide databases, 88% sequences were found in ZmDB or NCBI, and 12% sequences were not found in existing nucleotide databases. More than 73% sequences are of unknown function. The library could be extensively used in developing DNA markers, sequencing ESTs, mining new genes, identifying positional cloning and candidate gene, and developing microarrays in maize genomics research.  相似文献   

17.
Current microbial source tracking (MST) methods for water depend on testing for fecal indicator bacterial counts or specific marker gene sequences to identify fecal contamination where potential human pathogenic bacteria could be present. In this study, we applied 454 high-throughput pyrosequencing to identify bacterial pathogen DNA sequences, including those not traditionally monitored by MST and correlated their abundances to specific sources of contamination such as urban runoff and agricultural runoff from concentrated animal feeding operations (CAFOs), recreation park area, waste-water treatment plants, and natural sites with little or no human activities. Samples for pyrosequencing were surface water, and sediment collected from 19 sites. A total of 12,959 16S rRNA gene sequences with average length of ≤400 bp were obtained, and were assigned to corresponding taxonomic ranks using ribosomal database project (RDP), Classifier and Greengenes databases. The percent of total potential pathogens were highest in urban runoff water (7.94%), agricultural runoff sediment (6.52%), and Prado Park sediment (6.00%), respectively. Although the numbers of DNA sequence tags from pyrosequencing were very high for the natural site, corresponding percent potential pathogens were very low (3.78–4.08%). Most of the potential pathogenic bacterial sequences identified were from three major phyla, namely, Proteobacteria, Bacteroidetes, and Firmicutes. The use of deep sequencing may provide improved and faster methods for the identification of pathogen sources in most watersheds so that better risk assessment methods may be developed to enhance public health.  相似文献   

18.
Palindromati, the massive host-edited synthetic palindromic contamination found in GenBank, is illustrated and exemplified. Millions of contaminated sequences with portions or tandems of such portions derived from the ZAP adaptor or related linkers are shown (1) by the 12-bp sequence reported elsewhere, exon Xb, 5' CCCGAATTCGGG 3', (2) by a 22-bp related sequence 5' CTCGTGCCGAATTCGGCACGAG 3', and (3) by a longer 44-bp related sequence: 5' CTCGTGCCGAATTCGGCACGAGCTCGTGCCGAATTCGGCACGAG 3'. Possible reasons for why those long contaminating sequences continue in the databases are presented here: (1) the recognition site for the plus strand (+) is single-strand self-annealed; (2) the recognition site for the minus strand (-) is not only single-strand self-annealed but also located far away from the single-strand self-annealed plus strand, rendering impossible the formation of the active EcoRI enzyme dimer to cut on 5' G/AATTC 3', its target sequence. As a possible solution, it is suggested to rely on at least two or three independent results, such as sequences obtained by independent laboratories with the use, preferably, of independent sequencing methodologies. This information may help to develop tools for bioinformatics capable to detect/remove these contaminants and to infer why some damaged sequences which cause genetic diseases escape detection by the molecular quality control mechanism of cells and organisms, being undesirably transferred unchecked through the generations.  相似文献   

19.
BLAST (Basic Local Alignment Search Tool) searches against DNA and protein sequence databases have become an indispensable tool for biomedical research. The proliferation of the genome sequencing projects is steadily increasing the fraction of genome-derived sequences in the public databases and their importance as a public resource. We report here the availability of Genomic BLAST, a novel graphical tool for simplifying BLAST searches against complete and unfinished genome sequences. This tool allows the user to compare the query sequence against a virtual database of DNA and/or protein sequences from a selected group of organisms with finished or unfinished genomes. The organisms for such a database can be selected using either a graphic taxonomy-based tree or an alphabetical list of organism-specific sequences. The first option is designed to help explore the evolutionary relationships among organisms within a certain taxonomy group when performing BLAST searches. The use of an alphabetical list allows the user to perform a more elaborate set of selections, assembling any given number of organism-specific databases from unfinished or complete genomes. This tool, available at the NCBI web site http://www.ncbi.nlm.nih.gov/cgi-bin/Entrez/genom_table_cgi, currently provides access to over 170 bacterial and archaeal genomes and over 40 eukaryotic genomes.  相似文献   

20.
The heat shock RNA-1 (HSR1) is a noncoding RNA (ncRNA) reported to be involved in mammalian heat shock response. HSR1 was shown to significantly stimulate the heat-shock factor 1 (HSF1) trimerization and DNA binding. The hamster HSR1 sequence was reported to consist of 604 nucleotides (nt) plus a poly(A) tail and to have only a 4-nt difference with the human HSR1. In this study, we present highly convincing evidence for bacterial origin of the HSR1. No HSR1 sequence was found by exhaustive sequence similarity searches of the publicly available eukaryotic nucleotide sequence databases at the NCBI, including the expressed sequence tags, genome survey sequences, and high-throughput genomic sequences divisions of GenBank, as well as the Trace Archive database of whole genome shotgun sequences, and genome assemblies. Instead, a putative open reading frame (ORF) of HSR1 revealed strong similarity to the amino-terminal region of bacterial chloride channel proteins. Furthermore, the 5′ flanking region of the putative HSR1 ORF showed similarity to the 5′ upstream regions of the bacterial protein genes. We propose that the HSR1 was derived from a bacterial genome fragment either by horizontal gene transfer or by bacterial infection of the cells. The most probable source organism of the HSR1 is a species belonging to the order Burkholderiales.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号