首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
With the advent of short-reads-based genome sequencing approaches, large number of organisms are being sequenced all over the world. Most of these assemblies are done using some de novo short read assemblers and other related approaches. However, the contigs produced this way are prone to wrong assembly. So far, there is a conspicuous dearth of reliable tools to identify mis-assembled contigs. Mis-assemblies could result from incorrectly deleted or wrongly arranged genomic sequences. In the present work various factors related to sequence, sequencing and assembling have been assessed for their role in causing mis-assembly by using different genome sequencing data. Finally, some mis-assembly detecting tools have been evaluated for their ability to detect the wrongly assembled primary contigs, suggesting a lot of scope for improvement in this area. The present work also proposes a simple unsupervised learning-based novel approach to identify mis-assemblies in the contigs which was found performing reasonably well when compared to the already existing tools to report mis-assembled contigs. It was observed that the proposed methodology may work as a complementary system to the existing tools to enhance their accuracy.  相似文献   

2.
《Genomics》2022,114(6):110514
Omphalotus guepiniiformis, a bioluminescent mushroom species, is a source of the potentially valuable anticancer chemical. To provide genome information, we de novo assembled the high-quality O. guepiniiformis genome using two Next-Generation sequencing techniques, PacBio and Illumina sequencing. Our draft O. guepiniiformis genome comprises 42.5 Mbp of sequence with only 80 contigs and an N50 sequence length of over 1 Mbp. There were 15,554 predicted coding genes, and 7693 genes were functionally annotated with Gene Ontology terms. We performed a genomic study focusing on mushroom bioluminescent pathway cluster genes by comparing 17 luminescent and 23 non-luminescent Agaricales species belonging to 23 genera. Synteny analysis of genomic regions near the luminescent pathway cluster genes inferred that the Omphalotus lineage was genus-specific. In summary, our de novo assembled O. guepiniiformis genome provides significant biological insights into this organism, including the evolution of the luciferase gene block, and forms the basis for future analyses.  相似文献   

3.
Next Generation Sequencing (NGS) is a disruptive technology that has found widespread acceptance in the life sciences research community. The high throughput and low cost of sequencing has encouraged researchers to undertake ambitious genomic projects, especially in de novo genome sequencing. Currently, NGS systems generate sequence data as short reads and de novo genome assembly using these short reads is computationally very intensive. Due to lower cost of sequencing and higher throughput, NGS systems now provide the ability to sequence genomes at high depth. However, currently no report is available highlighting the impact of high sequence depth on genome assembly using real data sets and multiple assembly algorithms. Recently, some studies have evaluated the impact of sequence coverage, error rate and average read length on genome assembly using multiple assembly algorithms, however, these evaluations were performed using simulated datasets. One limitation of using simulated datasets is that variables such as error rates, read length and coverage which are known to impact genome assembly are carefully controlled. Hence, this study was undertaken to identify the minimum depth of sequencing required for de novo assembly for different sized genomes using graph based assembly algorithms and real datasets. Illumina reads for E.coli (4.6 MB) S.kudriavzevii (11.18 MB) and C.elegans (100 MB) were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. Our analysis shows that 50X is the optimum read depth for assembling these genomes using all assemblers except Meraculous which requires 100X read depth. Moreover, our analysis shows that de novo assembly from 50X read data requires only 6–40 GB RAM depending on the genome size and assembly algorithm used. We believe that this information can be extremely valuable for researchers in designing experiments and multiplexing which will enable optimum utilization of sequencing as well as analysis resources.  相似文献   

4.
Current computational methods used to analyze changes in DNA methylation and chromatin modification rely on sequenced genomes. Here we describe a pipeline for the detection of these changes from short-read sequence data that does not require a reference genome. Open source software packages were used for sequence assembly, alignment, and measurement of differential enrichment. The method was evaluated by comparing results with reference-based results showing a strong correlation between chromatin modification and gene expression. We then used our de novo sequence assembly to build the DNA methylation profile for the non-referenced Psammomys obesus genome. The pipeline described uses open source software for fast annotation and visualization of unreferenced genomic regions from short-read data.  相似文献   

5.
Genome synthesis endows scientists the ability of de novo creating genomes absent in nature, by thorough redesigning DNA sequences and introducing numerous custom features. However, the genome synthesis is a labor‐ and time‐consuming work, and thus it is a challenge to verify and quantify the synthetic genome rapidly and precisely. Thus, specific DNA sequences different from native genomic sequences are designed into synthetic genomes during synthesis, namely genomic markers. Genomic markers can be easily detected by PCR reaction, whole‐genome sequencing (WGS) and a variety of methods to identify the synthetic genome from native one. Here, we review types and applications of genomic markers utilized in synthetic genomes, with the hope of providing a guidance for future works.  相似文献   

6.
7.
8.
Long terminal repeat (LTR) retrotransposons and endogenous retroviruses (ERVs) are transposable elements in eukaryotic genomes well suited for computational identification. De novo identification tools determine the position of potential LTR retrotransposon or ERV insertions in genomic sequences. For further analysis, it is desirable to obtain an annotation of the internal structure of such candidates. This article presents LTRdigest, a novel software tool for automated annotation of internal features of putative LTR retrotransposons. It uses local alignment and hidden Markov model-based algorithms to detect retrotransposon-associated protein domains as well as primer binding sites and polypurine tracts. As an example, we used LTRdigest results to identify 88 (near) full-length ERVs in the chromosome 4 sequence of Mus musculus, separating them from truncated insertions and other repeats. Furthermore, we propose a work flow for the use of LTRdigest in de novo LTR retrotransposon classification and perform an exemplary de novo analysis on the Drosophila melanogaster genome as a proof of concept. Using a new method solely based on the annotations generated by LTRdigest, 518 potential LTR retrotransposons were automatically assigned to 62 candidate groups. Representative sequences from 41 of these 62 groups were matched to reference sequences with >80% global sequence similarity.  相似文献   

9.
10.
Novel sequences are DNA sequences present in an individual''s genome but absent in the human reference assembly. They are predicted to be biologically important, both individual and population specific, and consistent with the known human migration paths. Recent works have shown that an average person harbors 2–5 Mb of such sequences and estimated that the human pan-genome contains as high as 19–40 Mb of novel sequences. To identify them in a de novo genome assembly, some existing sequence aligners have been used but no computational method has been specifically proposed for this task. In this work, we developed NSIT (Novel Sequence Identification Tool), a software that can accurately and efficiently identify novel sequences in an individual''s de novo whole genome assembly. We identified and characterized 1.1 Mb, 1.2 Mb, and 1.0 Mb of novel sequences in NA18507 (African), YH (Asian), and NA12878 (European) de novo genome assemblies, respectively. Our results show very high concordance with the previous work using the respective reference assembly. In addition, our results using the latest human reference assembly suggest that the amount of novel sequences per individual may not be as high as previously reported. We additionally developed a graphical viewer for comparisons of novel sequence contents. The viewer also helped in identifying sequence contamination; we found 130 kb of Epstein-Barr virus sequence in the previously published NA18507 novel sequences as well as 287 kb of zebrafish repeats in NA12878 de novo assembly. NSIT requires 2GB of RAM and 1.5–2 hrs on a commodity desktop. The program is applicable to input assemblies with varying contig/scaffold sizes, ranging from 100 bp to as high as 50 Mb. It works in both 32-bit and 64-bit systems and outperforms, by large margins, other fast sequence aligners previously applied to this task. To our knowledge, NSIT is the first software designed specifically for novel sequence identification in a de novo human genome assembly.  相似文献   

11.
We have developed a method for the de novo discovery of genetic variations, including single nucleotide polymorphisms and mutations, on microelectronic chip devices. The method combines the features of electronically controlled DNA hybridisation on open-format microarrays, with mutation detection by a fluorescence-labelled mismatch- binding protein. Electronic addressing of DNA strands to distinct test sites of the chip allows parallel analysis of several individuals, as demonstrated for mutations in different exons of the p53 gene. This microelectronic chip-based mutation discovery assay may substitute for time-consuming sequencing studies and will complement existing technologies in genomic research.  相似文献   

12.
13.
A major challenge in designing proteins de novo to bind user-defined ligands with high affinity is finding backbones structures into which a new binding site geometry can be engineered with high precision. Recent advances in methods to generate protein fold families de novo have expanded the space of accessible protein structures, but it is not clear to what extend de novo proteins with diverse geometries also expand the space of designable ligand binding functions. We constructed a library of 25,806 high-quality ligand binding sites and developed a fast protocol to place (“match”) these binding sites into both naturally occurring and de novo protein families with two fold topologies: Rossman and NTF2. Each matching step involves engineering new binding site residues into each protein “scaffold”, which is distinct from the problem of comparing already existing binding pockets. 5,896 and 7,475 binding sites could be matched to the Rossmann and NTF2 fold families, respectively. De novo designed Rossman and NTF2 protein families can support 1,791 and 678 binding sites that cannot be matched to naturally existing structures with the same topologies, respectively. While the number of protein residues in ligand binding sites is the major determinant of matching success, ligand size and primary sequence separation of binding site residues also play important roles. The number of matched binding sites are power law functions of the number of members in a fold family. Our results suggest that de novo sampling of geometric variations on diverse fold topologies can significantly expand the space of designable ligand binding sites for a wealth of possible new protein functions.  相似文献   

14.

Background

The different regions of a genome do not evolve at the same rate. For example, comparative genomic studies have suggested that the sex chromosomes and the regions harbouring the immune defence genes in the Major Histocompatability Complex (MHC) may evolve faster than other genomic regions. The advent of the next generation sequencing technologies has made it possible to study which genomic regions are evolutionary liable to change and which are static, as well as enabling an increasing number of genome studies of non-model species. However, de novo sequencing of the whole genome of an organism remains non-trivial. In this study, we present the draft genome of the black grouse, which was developed using a reference-guided assembly strategy.

Results

We generated 133 Gbp of sequence data from one black grouse individual by the SOLiD platform and used a combination of de novo assembly and chicken reference genome mapping to assemble the reads into 4572 scaffolds with a total length of 1022 Mb. The draft genome well covers the main chicken chromosomes 1 ~ 28 and Z which have a total length of 1001 Mb. The draft genome is fragmented, but has a good coverage of the homologous chicken genes. Especially, 33.0% of the coding regions of the homologous genes have more than 90% proportion of their sequences covered. In addition, we identified ~1 M SNPs from the genome and identified 106 genomic regions which had a high nucleotide divergence between black grouse and chicken or between black grouse and turkey.

Conclusions

Our results support the hypothesis that the chromosome X (Z) evolves faster than the autosomes and our data are consistent with the MHC regions being more liable to change than the genome average. Our study demonstrates how a moderate sequencing effort can be combined with existing genome references to generate a draft genome for a non-model species.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-180) contains supplementary material, which is available to authorized users.  相似文献   

15.
Protein interaction maps can reveal novel pathways and functional complexes, allowing ‘guilt by association’ annotation of uncharacterized proteins. To address the need for large-scale protein interaction analyses, a bacterial two-hybrid system was coupled with a whole genome shotgun sequencing approach for microbial genome analysis. We report the first large-scale proteomics study using this system, integrating de novo genome sequencing with functional interaction mapping and annotation in a high-throughput format. We apply the approach by shotgun sequencing and annotating the genome of Rickettsia sibirica strain 246, an obligate intracellular human pathogen among the Spotted Fever Group rickettsiae. The bacteria invade endothelial cells and cause lysis after large amounts of progeny have accumulated. Little is known about specific Rickettsial virulence factors and their mode of pathogenicity. Analysis of the combined genomic sequence and protein–protein interaction data for a set of virulence related Type IV secretion system (T4SS) proteins revealed over 250 interactions and will provide insight into the mechanism of Rickettsial pathogenicity.  相似文献   

16.
Rapid advances in sequencing technologies of second- and even third-generation made the whole genome sequencing a routine procedure. However, the methods for assembling of the obtained sequences and its results require special consideration. Modern assemblers are based on heuristic algorithms, which lead to fragmented genome assembly composed of scaffolds and contigs of different lengths, the order of which along the chromosome and belonging to a particular chromosome often remain unknown. In this regard, the resulting genome sequence can only be considered as a draft assembly. The principal improvement in the quality and reliability of a draft assembly can be achieved by targeted sequencing of the genome elements of different size, e.g., chromosomes, chromosomal regions, and DNA fragments cloned in different vectors, as well as using reference genome, optical mapping, and Hi-C technology. This approach, in addition to simplifying the assembly of the genome draft, will more accurately identify numerical and structural chromosomal variations and abnormalities of the genomes of the studied species. In this review, we discuss the key technologies for the genome sequencing and the de novo assembly, as well as different approaches to improve the quality of existing drafts of genome sequences.  相似文献   

17.
18.

Background

Third generation sequencing methods, like SMRT (Single Molecule, Real-Time) sequencing developed by Pacific Biosciences, offer much longer read length in comparison to Next Generation Sequencing (NGS) methods. Hence, they are well suited for de novo- or re-sequencing projects. Sequences generated for these purposes will not only contain reads originating from the nuclear genome, but also a significant amount of reads originating from the organelles of the target organism. These reads are usually discarded but they can also be used for an assembly of organellar replicons. The long read length supports resolution of repetitive regions and repeats within the organelles genome which might be problematic when just using short read data. Additionally, SMRT sequencing is less influenced by GC rich areas and by long stretches of the same base.

Results

We describe a workflow for a de novo assembly of the sugar beet (Beta vulgaris ssp. vulgaris) chloroplast genome sequence only based on data originating from a SMRT sequencing dataset targeted on its nuclear genome. We show that the data obtained from such an experiment are sufficient to create a high quality assembly with a higher reliability than assemblies derived from e.g. Illumina reads only. The chloroplast genome is especially challenging for de novo assembling as it contains two large inverted repeat (IR) regions. We also describe some limitations that still apply even though long reads are used for the assembly.

Conclusions

SMRT sequencing reads extracted from a dataset created for nuclear genome (re)sequencing can be used to obtain a high quality de novo assembly of the chloroplast of the sequenced organism. Even with a relatively small overall coverage for the nuclear genome it is possible to collect more than enough reads to generate a high quality assembly that outperforms short read based assemblies. However, even with long reads it is not always possible to clarify the order of elements of a chloroplast genome sequence reliantly which we could demonstrate with Fosmid End Sequences (FES) generated with Sanger technology. Nevertheless, this limitation also applies to short read sequencing data but is reached in this case at a much earlier stage during finishing.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0726-6) contains supplementary material, which is available to authorized users.  相似文献   

19.
The application of next-generation sequencing to estimate genetic diversity of Plasmodium falciparum, the most lethal malaria parasite, has proved challenging due to the skewed AT-richness [∼80.6% (A + T)] of its genome and the lack of technology to assemble highly polymorphic subtelomeric regions that contain clonally variant, multigene virulence families (Ex: var and rifin). To address this, we performed amplification-free, single molecule, real-time sequencing of P. falciparum genomic DNA and generated reads of average length 12 kb, with 50% of the reads between 15.5 and 50 kb in length. Next, using the Hierarchical Genome Assembly Process, we assembled the P. falciparum genome de novo and successfully compiled all 14 nuclear chromosomes telomere-to-telomere. We also accurately resolved centromeres [∼90–99% (A + T)] and subtelomeric regions and identified large insertions and duplications that add extra var and rifin genes to the genome, along with smaller structural variants such as homopolymer tract expansions. Overall, we show that amplification-free, long-read sequencing combined with de novo assembly overcomes major challenges inherent to studying the P. falciparum genome. Indeed, this technology may not only identify the polymorphic and repetitive subtelomeric sequences of parasite populations from endemic areas but may also evaluate structural variation linked to virulence, drug resistance and disease transmission.  相似文献   

20.
The proliferation of genomic sequencing approaches has significantly impacted the field of phylogenetics. Target capture approaches provide a cost-effective, fast and easily applied strategy for phylogenetic inference of non-model organisms. However, several existing target capture processing pipelines are incapable of incorporating whole genome sequencing (WGS). Here, we develop a new pipeline for capture and de novo assembly of the targeted regions using whole genome re-sequencing reads. This new pipeline captured targeted loci accurately, and given its unbiased nature, can be used with any target capture probe set. Moreover, due to its low computational demand, this new pipeline may be ideal for users with limited resources and when high-coverage sequencing outputs are required. We demonstrate the utility of our approach by incorporating WGS data into the first comprehensive phylogenomic reconstruction of the freshwater mussel family Margaritiferidae. We also provide a catalogue of well-curated functional annotations of these previously uncharacterized freshwater mussel-specific target regions, representing a complementary tool for scrutinizing phylogenetic inferences while expanding future applications of the probe set.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号