首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
Current computational methods used to analyze changes in DNA methylation and chromatin modification rely on sequenced genomes. Here we describe a pipeline for the detection of these changes from short-read sequence data that does not require a reference genome. Open source software packages were used for sequence assembly, alignment, and measurement of differential enrichment. The method was evaluated by comparing results with reference-based results showing a strong correlation between chromatin modification and gene expression. We then used our de novo sequence assembly to build the DNA methylation profile for the non-referenced Psammomys obesus genome. The pipeline described uses open source software for fast annotation and visualization of unreferenced genomic regions from short-read data.  相似文献   

3.
4.
《Epigenetics》2013,8(10):1329-1338
Current computational methods used to analyze changes in DNA methylation and chromatin modification rely on sequenced genomes. Here we describe a pipeline for the detection of these changes from short-read sequence data that does not require a reference genome. Open source software packages were used for sequence assembly, alignment, and measurement of differential enrichment. The method was evaluated by comparing results with reference-based results showing a strong correlation between chromatin modification and gene expression. We then used our de novo sequence assembly to build the DNA methylation profile for the non-referenced Psammomys obesus genome. The pipeline described uses open source software for fast annotation and visualization of unreferenced genomic regions from short-read data.  相似文献   

5.
Next-generation sequencing (NGS) approaches rapidly produce millions to billions of short reads, which allow pathogen detection and discovery in human clinical, animal and environmental samples. A major limitation of sequence homology-based identification for highly divergent microorganisms is the short length of reads generated by most highly parallel sequencing technologies. Short reads require a high level of sequence similarities to annotated genes to confidently predict gene function or homology. Such recognition of highly divergent homologues can be improved by reference-free (de novo) assembly of short overlapping sequence reads into larger contigs. We describe an ensemble strategy that integrates the sequential use of various de Bruijn graph and overlap-layout-consensus assemblers with a novel partitioned sub-assembly approach. We also proposed new quality metrics that are suitable for evaluating metagenome de novo assembly. We demonstrate that this new ensemble strategy tested using in silico spike-in, clinical and environmental NGS datasets achieved significantly better contigs than current approaches.  相似文献   

6.

Background

Sampling genomes with Fosmid vectors and sequencing of pooled Fosmid libraries on the Illumina platform for massive parallel sequencing is a novel and promising approach to optimizing the trade-off between sequencing costs and assembly quality.

Results

In order to sequence the genome of Norway spruce, which is of great size and complexity, we developed and applied a new technology based on the massive production, sequencing, and assembly of Fosmid pools (FP). The spruce chromosomes were sampled with ~40,000 bp Fosmid inserts to obtain around two-fold genome coverage, in parallel with traditional whole genome shotgun sequencing (WGS) of haploid and diploid genomes. Compared to the WGS results, the contiguity and quality of the FP assemblies were high, and they allowed us to fill WGS gaps resulting from repeats, low coverage, and allelic differences. The FP contig sets were further merged with WGS data using a novel software package GAM-NGS.

Conclusions

By exploiting FP technology, the first published assembly of a conifer genome was sequenced entirely with massively parallel sequencing. Here we provide a comprehensive report on the different features of the approach and the optimization of the process.We have made public the input data (FASTQ format) for the set of pools used in this study:ftp://congenie.org/congenie/Nystedt_2013/Assembly/ProcessedData/FosmidPools/.(alternatively accessible via http://congenie.org/downloads).The software used for running the assembly process is available at http://research.scilifelab.se/andrej_alexeyenko/downloads/fpools/.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-439) contains supplementary material, which is available to authorized users.  相似文献   

7.
Jiang  Shuangying  Tang  Yuanwei  Xiang  Liang  Zhu  Xinlu  Cai  Zelin  Li  Ling  Chen  Yingxi  Chen  Peishuang  Feng  Yuge  Lin  Xin  Li  Guoqiang  Sharif  Jafar  Dai  Junbiao 《中国科学:生命科学英文版》2022,65(7):1445-1455
Science China Life Sciences - Synthetic genomics has provided new bottom-up platforms for the functional study of viral and microbial genomes. The construction of the large, gigabase (Gb)-sized...  相似文献   

8.
We offer a guide to de novo genome assembly1 using sequence data generated by the Illumina platform for biologists working with fungi or other organisms whose genomes are less than 100 Mb in size. The guide requires no familiarity with sequencing assembly technology or associated computer programs. It defines commonly used terms in genome sequencing and assembly; provides examples of assembling short-read genome sequence data for four strains of the fungus Grosmannia clavigera using four assembly programs; gives examples of protocols and software; and presents a commented flowchart that extends from DNA preparation for submission to a sequencing center, through to processing and assembly of the raw sequence reads using freely available operating systems and software.  相似文献   

9.
10.
Vezzi F  Narzisi G  Mishra B 《PloS one》2012,7(2):e31002
The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics and read simulation: while on the one hand, metrics like N50 and number of contigs focus only on size without proportionately emphasizing the information about the correctness of the assembly, comparisons performed on simulated dataset, on the other hand, can be highly biased by the non-realistic assumptions in the underlying read generator. Recently the Feature Response Curve (FRC) method was proposed to assess the overall assembly quality and correctness: FRC transparently captures the trade-offs between contigs' quality against their sizes. Nevertheless, the relationship among the different features and their relative importance remains unknown. In particular, FRC cannot account for the correlation among the different features. We analyzed the correlation among different features in order to better describe their relationships and their importance in gauging assembly quality and correctness. In particular, using multivariate techniques like principal and independent component analysis we were able to estimate the "excess-dimensionality" of the feature space. Moreover, principal component analysis allowed us to show how poorly the acclaimed N50 metric describes the assembly quality. Applying independent component analysis we identified a subset of features that better describe the assemblers performances. We demonstrated that by focusing on a reduced set of highly informative features we can use the FRC curve to better describe and compare the performances of different assemblers. Moreover, as a by-product of our analysis, we discovered how often evaluation based on simulated data, obtained with state of the art simulators, lead to not-so-realistic results.  相似文献   

11.
12.
13.
14.
Here we use whole-genome de novo assembly of second-generation sequencing reads to map structural variation (SV) in an Asian genome and an African genome. Our approach identifies small- and intermediate-size homozygous variants (1-50 kb) including insertions, deletions, inversions and their precise breakpoints, and in contrast to other methods, can resolve complex rearrangements. In total, we identified 277,243 SVs ranging in length from 1-23 kb. Validation using computational and experimental methods suggests that we achieve overall <6% false-positive rate and <10% false-negative rate in genomic regions that can be assembled, which outperforms other methods. Analysis of the SVs in the genomes of 106 individuals sequenced as part of the 1000 Genomes Project suggests that SVs account for a greater fraction of the diversity between individuals than do single-nucleotide polymorphisms (SNPs). These findings demonstrate that whole-genome de novo assembly is a feasible approach to deriving more comprehensive maps of genetic variation.  相似文献   

15.
16.
17.
Pitzer E  Masselot A  Colinge J 《Proteomics》2007,7(17):3051-3054
De novo peptide sequencing algorithms are often tested on relatively small data sets made of excellent spectra. Since there are always more and more tandem mass spectra available, we have assembled six large, reliable, and diverse (three mass spectrometer types) data sets intended for such tests and we make them accessible via a web server. To exemplify their use we investigate the performance of Lutefisk, PepNovo, and PepNovoTag, three well-established peptide de novo sequencing programs.  相似文献   

18.
Pignatelli M  Moya A 《PloS one》2011,6(5):e19984
A frequent step in metagenomic data analysis comprises the assembly of the sequenced reads. Many assembly tools have been published in the last years targeting data coming from next-generation sequencing (NGS) technologies but these assemblers have not been designed for or tested in multi-genome scenarios that characterize metagenomic studies. Here we provide a critical assessment of current de novo short reads assembly tools in multi-genome scenarios using complex simulated metagenomic data. With this approach we tested the fidelity of different assemblers in metagenomic studies demonstrating that even under the simplest compositions the number of chimeric contigs involving different species is noticeable. We further showed that the assembly process reduces the accuracy of the functional classification of the metagenomic data and that these errors can be overcome raising the coverage of the studied metagenome. The results presented here highlight the particular difficulties that de novo genome assemblers face in multi-genome scenarios demonstrating that these difficulties, that often compromise the functional classification of the analyzed data, can be overcome with a high sequencing effort.  相似文献   

19.

Background  

Next-generation sequencing technologies allow genomes to be sequenced more quickly and less expensively than ever before. However, as sequencing technology has improved, the difficulty of de novo genome assembly has increased, due in large part to the shorter reads generated by the new technologies. The use of mated sequences (referred to as mate-pairs) is a standard means of disambiguating assemblies to obtain a more complete picture of the genome without resorting to manual finishing. Here, we examine the effectiveness of mate-pair information in resolving repeated sequences in the DNA (a paramount issue to overcome). While it has been empirically accepted that mate-pairs improve assemblies, and a variety of assemblers use mate-pairs in the context of repeat resolution, the effectiveness of mate-pairs in this context has not been systematically evaluated in previous literature.  相似文献   

20.
Assembling individual genomes from complex community metagenomic data remains a challenging issue for environmental studies. We evaluated the quality of genome assemblies from community short read data (Illumina 100 bp pair-ended sequences) using datasets recovered from freshwater and soil microbial communities as well as in silico simulations. Our analyses revealed that the genome of a single genotype (or species) can be accurately assembled from a complex metagenome when it shows at least about 20 × coverage. At lower coverage, however, the derived assemblies contained a substantial fraction of non-target sequences (chimeras), which explains, at least in part, the higher number of hypothetical genes recovered in metagenomic relative to genomic projects. We also provide examples of how to detect intrapopulation structure in metagenomic datasets and estimate the type and frequency of errors in assembled genes and contigs from datasets of varied species complexity.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号