首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background  

Whole genome shotgun sequencing produces increasingly higher coverage of a genome with random sequence reads. Progressive whole genome assembly and eventual finishing sequencing is a process that typically takes several years for large eukaryotic genomes. In the interim, all sequence reads of public sequencing projects are made available in repositories such as the NCBI Trace Archive. For a particular locus, sequencing coverage may be high enough early on to produce a reliable local genome assembly. We have developed software, Tracembler, that facilitates in silico chromosome walking by recursively assembling reads of a selected species from the NCBI Trace Archive starting with reads that significantly match sequence seeds supplied by the user.  相似文献   

2.
Next‐generation sequencing continues to revolutionize biodiversity studies by generating unprecedented amounts of DNA sequence data for comparative genomic analysis. However, these data are produced as millions or billions of short reads of variable quality that cannot be directly applied in comparative analyses, creating a demand for methods to facilitate assembly. We optimized an in silico strategy to efficiently reconstruct high‐quality mitochondrial genomes directly from genomic reads. We tested this strategy using sequences from five species of frogs: Hylodes meridionalis (Hylodidae), Hyloxalus yasuni (Dendrobatidae), Pristimantis fenestratus (Craugastoridae), and Melanophryniscus simplex and Rhinella sp. (Bufonidae). These are the first mitogenomes published for these species, the genera Hylodes, Hyloxalus, Pristimantis, Melanophryniscus and Rhinella, and the families Craugastoridae and Hylodidae. Sequences were generated using only half of one lane of a standard Illumina HiqSeq 2000 flow cell, resulting in fewer than eight million reads. We analysed the reads of Hylodes meridionalis using three different assembly strategies: (1) reference‐based (using bowtie2 ); (2) de novo (using abyss , soapdenovo2 and velvet ); and (3) baiting and iterative mapping (using mira and mitobim ). Mitogenomes were assembled exclusively with strategy 3, which we employed to assemble the remaining mitogenomes. Annotations were performed with mitos and confirmed by comparison with published amphibian mitochondria. In most cases, we recovered all 13 coding genes, 22 tRNAs, and two ribosomal subunit genes, with minor gene rearrangements. Our results show that few raw reads can be sufficient to generate high‐quality scaffolds, making any Illumina machine run using genomic multiplex libraries a potential source of data for organelle assemblies as by‐catch.  相似文献   

3.
Traditional approaches for sequencing insertion ends of bacterial artificial chromosome (BAC) libraries are laborious and expensive, which are currently some of the bottlenecks limiting a better understanding of the genomic features of auto‐ or allopolyploid species. Here, we developed a highly efficient and low‐cost BAC end analysis protocol, named BAC‐anchor, to identify paired‐end reads containing large internal gaps. Our approach mainly focused on the identification of high‐throughput sequencing reads carrying restriction enzyme cutting sites and searching for large internal gaps based on the mapping locations of both ends of the reads. We sequenced and analysed eight libraries containing over 3 200 000 BAC end clones derived from the BAC library of the tetraploid potato cultivar C88 digested with two restriction enzymes, Cla I and Mlu I. About 25% of the BAC end reads carrying cutting sites generated a 60–100 kb internal gap in the potato DM reference genome, which was consistent with the mapping results of Sanger sequencing of the BAC end clones and indicated large differences between autotetraploid and haploid genotypes in potato. A total of 5341 Cla I‐ and 165 Mlu I‐derived unique reads were distributed on different chromosomes of the DM reference genome and could be used to establish a physical map of target regions and assemble the C88 genome. The reads that matched different chromosomes are especially significant for the further assembly of complex polyploid genomes. Our study provides an example of analysing high‐coverage BAC end libraries with low sequencing cost and is a resource for further genome sequencing studies.  相似文献   

4.
5.
Predicting whether a predator is capable of affecting the dynamics of a prey species in the field implies the analysis of the complete diet of the predator, not simply rates of predation on a target taxon. Here, we employed the Ion Torrent next‐generation sequencing technology to investigate the diet of a generalist arthropod predator. A complete dietary analysis requires the use of general primers, but these will also amplify the predator unless suppressed using a blocking probe. However, blocking probes can potentially block other species, particularly if they are phylogenetically close. Here, we aimed to demonstrate that enough prey sequence could be obtained without blocking probes. In communities with many predators, this approach obviates the need to design and test numerous blocking primers, thus making analysis of complex community food webs a viable proposition. We applied this approach to the analysis of predation by the linyphiid spider Oedothorax fuscus in an arable field. We obtained over two million raw reads. After discarding the low‐quality and predator reads, the libraries still contained over 61 000 prey reads (3% of the raw reads; 6% of reads passing quality control). The libraries were rich in Collembola, Lepidoptera, Diptera and Nematoda. They also contained sequences derived from several spider species and from horticultural pests (aphids). Oedothorax fuscus is common in UK cereal fields, and the results showed that it is exploiting a wide range of prey. Next‐generation sequencing using general primers but without blocking probes provided ample sequences for analysis of the prey range of this spider and proved to be a simple and inexpensive approach.  相似文献   

6.
Du  Nan  Chen  Jiao  Sun  Yanni 《BMC genomics》2019,20(2):49-62
Background

Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. One key step in genome assembly using long reads is to quickly identify reads forming overlaps. Because PacBio data has higher sequencing error rate and lower coverage than popular short read sequencing technologies (such as Illumina), efficient detection of true overlaps requires specially designed algorithms. In particular, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates in both reads. Addressing this need will enable better assembly for metagenomic data produced by third-generation sequencing technologies.

Results

In this work, we designed and implemented an overlap detection program named GroupK, for third-generation sequencing reads based on grouped k-mer hits. While using k-mer hits for detecting reads’ overlaps has been adopted by several existing programs, our method uses a group of short k-mer hits satisfying statistically derived distance constraints to increase the sensitivity of small overlap detection. Grouped k-mer hit was originally designed for homology search. We are the first to apply group hit for long read overlap detection. The experimental results of applying our pipeline to both simulated and real third-generation sequencing data showed that GroupK enables more sensitive overlap detection, especially for datasets of low sequencing coverage.

Conclusions

GroupK is best used for detecting small overlaps for third-generation sequencing data. It provides a useful supplementary tool to existing ones for more sensitive and accurate overlap detection. The source code is freely available at https://github.com/Strideradu/GroupK.

  相似文献   

7.
8.
Hybrid assembly strategies that combine long-read sequencing reads from Oxford Nanopore's MinION device combined with high-depth Illumina paired-end reads have enabled completion and circularization of both plasmids and chromosomes from multiple bacterial strains. Here we demonstrate the utility of supplementing Illumina paired-end reads from a previously published draft genome of P. syringae pv. pisi PP1 with long reads to generate a complete genome sequence for this strain. The phylogenetic placement and genomic repertoire of virulence factors within this strain provides a unique perspective on virulence evolution within P. syringae phylogroup 2, and highlights that strains can rapidly acquire virulence factors through horizontal gene transfer by acquisition of plasmids as well as through chromosomal recombination.  相似文献   

9.
We present the development of a genomic library using RADseq (restriction site associated DNA sequencing) protocol for marker discovery that can be applied on evolutionary studies of the sugarcane borer Diatraea saccharalis, an important South American insect pest. A RADtag protocol combined with Illumina paired‐end sequencing allowed de novo discovery of 12 811 SNPs and a high‐quality assembly of 122.8M paired‐end reads from six individuals, representing 40 Gb of sequencing data. Approximately 1.7 Mb of the sugarcane borer genome distributed over 5289 minicontigs were obtained upon assembly of second reads from first reads RADtag loci where at least one SNP was discovered and genotyped. Minicontig lengths ranged from 200 to 611 bp and were used for functional annotation and microsatellite discovery. These markers will be used in future studies to understand gene flow and adaptation to host plants and control tactics.  相似文献   

10.
The emergence of third‐generation sequencing (3GS; long‐reads) is bringing closer the goal of chromosome‐size fragments in de novo genome assemblies. This allows the exploration of new and broader questions on genome evolution for a number of nonmodel organisms. However, long‐read technologies result in higher sequencing error rates and therefore impose an elevated cost of sufficient coverage to achieve high enough quality. In this context, hybrid assemblies, combining short‐reads and long‐reads, provide an alternative efficient and cost‐effective approach to generate de novo, chromosome‐level genome assemblies. The array of available software programs for hybrid genome assembly, sequence correction and manipulation are constantly being expanded and improved. This makes it difficult for nonexperts to find efficient, fast and tractable computational solutions for genome assembly, especially in the case of nonmodel organisms lacking a reference genome or one from a closely related species. In this study, we review and test the most recent pipelines for hybrid assemblies, comparing the model organism Drosophila melanogaster to a nonmodel cactophilic Drosophila, D. mojavensis. We show that it is possible to achieve excellent contiguity on this nonmodel organism using the dbg2olc pipeline.  相似文献   

11.

Background  

Next-generation sequencing technologies have led to the high-throughput production of sequence data (reads) at low cost. However, these reads are significantly shorter and more error-prone than conventional Sanger shotgun reads. This poses a challenge for the de novo assembly in terms of assembly quality and scalability for large-scale short read datasets.  相似文献   

12.
Background

A metagenome is a collection of genomes, usually in a micro-environment, and sequencing a metagenomic sample en masse is a powerful means for investigating the community of the constituent microorganisms. One of the challenges is in distinguishing between similar organisms due to rampant multiple possible assignments of sequencing reads, resulting in false positive identifications. We map the problem to a topological data analysis (TDA) framework that extracts information from the geometric structure of data. Here the structure is defined by multi-way relationships between the sequencing reads using a reference database.

Results

Based primarily on the patterns of co-mapping of the reads to multiple organisms in the reference database, we use two models: one a subcomplex of a Barycentric subdivision complex and the other a Čech complex. The Barycentric subcomplex allows a natural mapping of the reads along with their coverage of organisms while the Čech complex takes simply the number of reads into account to map the problem to homology computation. Using simulated genome mixtures we show not just enrichment of signal but also microbe identification with strain-level resolution.

Conclusions

In particular, in the most refractory of cases where alternative algorithms that exploit unique reads (i.e., mapped to unique organisms) fail, we show that the TDA approach continues to show consistent performance. The Čech model that uses less information is equally effective, suggesting that even partial information when augmented with the appropriate structure is quite powerful.

  相似文献   

13.
Predicted global climate change threatens the distributional ranges of species worldwide. We identified genes expressed in the intertidal seagrass Zostera noltii during recovery from a simulated low tide heat-shock exposure. Five Expressed Sequence Tag (EST) libraries were compared, corresponding to four recovery times following sub-lethal temperature stress, and a non-stressed control. We sequenced and analyzed 7009 sequence reads from 30 min, 2 h, 4 h and 24 h after the beginning of the heat-shock (AHS), and 1585 from the control library, for a total of 8594 sequence reads. Among 51 Tentative UniGenes (TUGs) exhibiting significantly different expression between libraries, 19 (37.3%) were identified as ‘molecular chaperones’ and were over-expressed following heat-shock, while 12 (23.5%) were ‘photosynthesis TUGs’ generally under-expressed in heat-shocked plants. A time course analysis of expression showed a rapid increase in expression of the molecular chaperone class, most of which were heat-shock proteins; which increased from 2 sequence reads in the control library to almost 230 in the 30 min AHS library, followed by a slow decrease during further recovery. In contrast, ‘photosynthesis TUGs’ were under-expressed 30 min AHS compared with the control library, and declined progressively with recovery time in the stress libraries, with a total of 29 sequence reads 24 h AHS, compared with 125 in the control. A total of 4734 TUGs were screened for EST-Single Sequence Repeats (EST-SSRs) and 86 microsatellites were identified.  相似文献   

14.
15.

Background  

High throughput sequencing (HTS) platforms produce gigabases of short read (<100 bp) data per run. While these short reads are adequate for resequencing applications, de novo assembly of moderate size genomes from such reads remains a significant challenge. These limitations could be partially overcome by utilizing mate pair technology, which provides pairs of short reads separated by a known distance along the genome.  相似文献   

16.
Microeukaryotic plankton (0.2–200 μm) are critical components of aquatic ecosystems and key players in global ecological processes. High‐throughput sequencing is currently revolutionizing their study on an unprecedented scale. However, it is currently unclear whether we can accurately, effectively and quantitatively depict the microeukaryotic plankton communities using traditional size‐fractionated filtering combined with molecular methods. To address this, we analysed the eukaryotic plankton communities both with, and without, prefiltering with a 200 μm pore‐size sieve –by using SSU rDNA‐based high‐throughput sequencing on 16 samples with three replicates in each sample from two subtropical reservoirs sampled from January to October in 2013. We found that ~25% reads were classified as metazoan in both size groups. The species richness, alpha and beta diversity of plankton community and relative abundance of reads in 99.2% eukaryotic OTUs showed no significant changes after prefiltering with a 200 μm pore‐size sieve. We further found that both >0.2 μm and 0.2–200 μm eukaryotic plankton communities, especially the abundant plankton subcommunities, exhibited very similar, and synchronous, spatiotemporal patterns and processes associated with almost identical environmental drivers. The lack of an effect on community structure from prefiltering suggests that environmental DNA from larger metazoa is introduced into the smaller size class. Therefore, size‐fractionated filtering with 200 μm is insufficient to discriminate between the eukaryotic plankton size groups in metabarcoding approaches. Our results also highlight the importance of sequencing depth, and strict quality filtering of reads, when designing studies to characterize microeukaryotic plankton communities.  相似文献   

17.
The high‐throughput capacities of the Illumina sequencing platforms and the possibility to label samples individually have encouraged wide use of sample multiplexing. However, this practice results in read misassignment (usually <1%) across samples sequenced on the same lane. Alarmingly high rates of read misassignment of up to 10% were reported for lllumina sequencing machines with exclusion amplification chemistry. This may make use of these platforms prohibitive, particularly in studies that rely on low‐quantity and low‐quality samples, such as historical and archaeological specimens. Here, we use barcodes, short sequences that are ligated to both ends of the DNA insert, to directly quantify the rate of index hopping in 100‐year old museum‐preserved gorilla (Gorilla beringei) samples. Correcting for multiple sources of noise, we identify on average 0.470% of reads containing a hopped index. We show that sample‐specific quantity of misassigned reads depends on the number of reads that any given sample contributes to the total sequencing pool, so that samples with few sequenced reads receive the greatest proportion of misassigned reads. This particularly affects ancient DNA samples, as these frequently differ in their DNA quantity and endogenous content. Through simulations we show that even low rates of index hopping, as reported here, can lead to biases in ancient DNA studies when multiplexing samples with vastly different quantities of endogenous material.  相似文献   

18.
19.
Sophora japonica is a traditional Chinese medicinal ingredient that is widely used in the medicine, food, and industrial dye industries. Since flavonoids are the main components of S. japonica, studying the flavonoid composition and content of this plant is important. This study aimed to identify molecules involved in the flavonoid biosynthetic pathways in S. japonica. Deep sequencing was performed, and 85,877,352 clean reads were filtered from 86,095,152 raw reads. The clean reads were spliced to obtain 111,382 unigenes, which were then annotated with NR, GO, KEGG, eggNOG. Differential expression analysis and NR function prediction revealed 18 differentially expressed unigenes associated with 13 enzymes in flavonoid biosynthetic pathways. Our results reveal new insights on secondary metabolite biosynthesis‐related genes in S. japonica and enhance the potential applications of S. japonica in genetic engineering.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号