共查询到20条相似文献,搜索用时 15 毫秒
1.
Annotated genomes can provide new perspectives on the biology of species. We present the first de novo whole genome sequencing for the pink-footed goose. In order to obtain a high-quality de novo assembly the strategy used was to combine one short insert paired-end library with two mate-pair libraries. The pink-footed goose genome was assembled de novo using three different assemblers and an assembly evaluation was subsequently performed in order to choose the best assembler. For our data, ALLPATHS-LG performed the best, since the assembly produced covers most of the genome, while introducing the fewest errors. A total of 26,134 genes were annotated, with bird species accounting for virtually all BLAST hits. We also estimated the substitution rate in the pink-footed goose, which can be of use in future demographic studies, by using a comparative approach with the genome of the chicken, the mallard and the swan goose. A substitution rate of 1.38 × 10 ? 7 per nucleotide per generation was obtained when comparing the genomes of the two closely-related goose species (the pink-footed and the swan goose). Altogether, we provide a valuable tool for future genomic studies aiming at particular genes and regions of the pink-footed goose genome as well as other bird species. 相似文献
2.
Genome sequencing is now affordable, but assembling plant genomes de novo remains challenging. We assess the state of the art of assembly and review the best practices for the community. 相似文献
3.
The human genome reference (HGR) completion marked the genomics era beginning, yet despite its utility universal application is limited by the small number of individuals used in its development. This is highlighted by the presence of high-quality sequence reads failing to map within the HGR. Sequences failing to map generally represent 2–5 % of total reads, which may harbor regions that would enhance our understanding of population variation, evolution, and disease. Alternatively, complete de novo assemblies can be created, but these effectively ignore the groundwork of the HGR. In an effort to find a middle ground, we developed a bioinformatic pipeline that maps paired-end reads to the HGR as separate single reads, exports unmappable reads, de novo assembles these reads per individual and then combines assemblies into a secondary reference assembly used for comparative analysis. Using 45 diverse 1000 Genomes Project individuals, we identified 351,361 contigs covering 195.5 Mb of sequence unincorporated in GRCh38. 30,879 contigs are represented in multiple individuals with ~40 % showing high sequence complexity. Genomic coordinates were generated for 99.9 %, with 52.5 % exhibiting high-quality mapping scores. Comparative genomic analyses with archaic humans and primates revealed significant sequence alignments and comparisons with model organism RefSeq gene datasets identified novel human genes. If incorporated, these sequences will expand the HGR, but more importantly our data highlight that with this method low coverage (~10–20×) next-generation sequencing can still be used to identify novel unmapped sequences to explore biological functions contributing to human phenotypic variation, disease and functionality for personal genomic medicine. 相似文献
4.
We describe a new assembly algorithm, where a genome assembly with low sequence coverage, either throughout the genome or
locally, due to cloning bias, is considerably improved through an assisting process via a related genome. We show that the
information provided by aligning the whole-genome shotgun reads of the target against a reference genome can be used to substantially
improve the quality of the resulting assembly. 相似文献
5.
We offer a guide to de novo genome assembly1 using sequence data generated by the Illumina platform for biologists working with fungi or other organisms whose genomes are less than 100 Mb in size. The guide requires no familiarity with sequencing assembly technology or associated computer programs. It defines commonly used terms in genome sequencing and assembly; provides examples of assembling short- read genome sequence data for four strains of the fungus Grosmannia clavigera using four assembly programs; gives examples of protocols and software; and presents a commented flowchart that extends from DNA preparation for submission to a sequencing center, through to processing and assembly of the raw sequence reads using freely available operating systems and software. 相似文献
6.
We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ~280 bp or ~3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed. 相似文献
7.
Whole genome amplification by the multiple displacement amplification (MDA) method allows sequencing of DNA from single cells of bacteria that cannot be cultured. Assembling a genome is challenging, however, because MDA generates highly nonuniform coverage of the genome. Here we describe an algorithm tailored for short-read data from single cells that improves assembly through the use of a progressively increasing coverage cutoff. Assembly of reads from single Escherichia coli and Staphylococcus aureus cells captures >91% of genes within contigs, approaching the 95% captured from an assembly based on many E. coli cells. We apply this method to assemble a genome from a single cell of an uncultivated SAR324 clade of Deltaproteobacteria, a cosmopolitan bacterial lineage in the global ocean. Metabolic reconstruction suggests that SAR324 is aerobic, motile and chemotaxic. Our approach enables acquisition of genome assemblies for individual uncultivated bacteria using only short reads, providing cell-specific genetic information absent from metagenomic studies. 相似文献
8.
BackgroundSampling genomes with Fosmid vectors and sequencing of pooled Fosmid libraries on the Illumina platform for massive parallel sequencing is a novel and promising approach to optimizing the trade-off between sequencing costs and assembly quality. ResultsIn order to sequence the genome of Norway spruce, which is of great size and complexity, we developed and applied a new technology based on the massive production, sequencing, and assembly of Fosmid pools (FP). The spruce chromosomes were sampled with ~40,000 bp Fosmid inserts to obtain around two-fold genome coverage, in parallel with traditional whole genome shotgun sequencing (WGS) of haploid and diploid genomes. Compared to the WGS results, the contiguity and quality of the FP assemblies were high, and they allowed us to fill WGS gaps resulting from repeats, low coverage, and allelic differences. The FP contig sets were further merged with WGS data using a novel software package GAM-NGS. ConclusionsBy exploiting FP technology, the first published assembly of a conifer genome was sequenced entirely with massively parallel sequencing. Here we provide a comprehensive report on the different features of the approach and the optimization of the process.We have made public the input data (FASTQ format) for the set of pools used in this study: ftp://congenie.org/congenie/Nystedt_2013/Assembly/ProcessedData/FosmidPools/.(alternatively accessible via http://congenie.org/downloads).The software used for running the assembly process is available at http://research.scilifelab.se/andrej_alexeyenko/downloads/fpools/. Electronic supplementary materialThe online version of this article (doi:10.1186/1471-2164-15-439) contains supplementary material, which is available to authorized users. 相似文献
9.
Recent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS) have rekindled a growing interest in the whole-genome sequence assembly (WGSA) problem, thereby, inundating the field with a plethora of new formalizations, algorithms, heuristics and implementations. And yet, scant attention has been paid to comparative assessments of these assemblers' quality and accuracy. No commonly accepted and standardized method for comparison exists yet. Even worse, widely used metrics to compare the assembled sequences emphasize only size, poorly capturing the contig quality and accuracy. This paper addresses these concerns: it highlights common anomalies in assembly accuracy through a rigorous study of several assemblers, compared under both standard metrics (N50, coverage, contig sizes, etc.) as well as a more comprehensive metric (Feature-Response Curves, FRC) that is introduced here; FRC transparently captures the trade-offs between contigs' quality against their sizes. For this purpose, most of the publicly available major sequence assemblers--both for low-coverage long (Sanger) and high-coverage short (Illumina) reads technologies--are compared. These assemblers are applied to microbial (Escherichia coli, Brucella, Wolbachia, Staphylococcus, Helicobacter) and partial human genome sequences (Chr. Y), using sequence reads of various read-lengths, coverages, accuracies, and with and without mate-pairs. It is hoped that, based on these evaluations, computational biologists will identify innovative sequence assembly paradigms, bioinformaticists will determine promising approaches for developing "next-generation" assemblers, and biotechnologists will formulate more meaningful design desiderata for sequencing technology platforms. A new software tool for computing the FRC metric has been developed and is available through the AMOS open-source consortium. 相似文献
10.
MOTIVATION: DNA repeats are a common feature of most genomic sequences. Their de novo identification is still difficult despite being a crucial step in genomic analysis and oligonucleotides design. Several efficient algorithms based on word counting are available, but too short words decrease specificity while long words decrease sensitivity, particularly in degenerated repeats. RESULTS: The Repeat Analysis Program (RAP) is based on a new word-counting algorithm optimized for high resolution repeat identification using gapped words. Many different overlapping gapped words can be counted at the same genomic position, thus producing a better signal than the single ungapped word. This results in better specificity both in terms of low-frequency detection, being able to identify sequences repeated only once, and highly divergent detection, producing a generally high score in most intron sequences. AVAILABILITY: The program is freely available for non-profit organizations, upon request to the authors. CONTACT: giorgio.valle@unipd.it SUPPLEMENTARY INFORMATION: The program has been tested on the Caenorhabditis elegans genome using word lengths of 12, 14 and 16 bases. The full analysis has been implemented in the UCSC Genome Browser and is accessible at http://genome.cribi.unipd.it. 相似文献
11.
The advent of next-generation sequencing technologies is accompanied with the development of many whole-genome sequence assembly methods and software, especially for de novo fragment assembly. Due to the poor knowledge about the applicability and performance of these software tools, choosing a befitting assembler becomes a tough task. Here, we provide the information of adaptivity for each program, then above all, compare the performance of eight distinct tools against eight groups of simulated datasets from Solexa sequencing platform. Considering the computational time, maximum random access memory (RAM) occupancy, assembly accuracy and integrity, our study indicate that string-based assemblers, overlap-layout-consensus (OLC) assemblers are well-suited for very short reads and longer reads of small genomes respectively. For large datasets of more than hundred millions of short reads, De Bruijn graph-based assemblers would be more appropriate. In terms of software implementation, string-based assemblers are superior to graph-based ones, of which SOAPdenovo is complex for the creation of configuration file. Our comparison study will assist researchers in selecting a well-suited assembler and offer essential information for the improvement of existing assemblers or the developing of novel assemblers. 相似文献
12.
Phytochrome-like genes in the wild plant species Rhazya stricta Decne were characterized using a de novo genome assembly of next generation sequence data. Rhazya stricta contains more than 100 alkaloids with multiple pharmacological properties, and leaf extracts have been used to cure chronic rheumatism, to treat tumors, and in the treatment of several other diseases. Phytochromes are known to be involved in the light-regulated biosynthesis of some alkaloids. Phytochromes are soluble chromoproteins that function in the absorption of red and far-red light and the transduction of intracellular signals during light-regulated plant development. De novo assembly of the nuclear genome of R. stricta recovered 45,641 contigs greater than 1000 bp long, which were used in constructing a local database. Five sequences belonging to Arabidopsis thaliana phytochrome gene family (i.e., AtphyABCDE) were used to identify R. stricta contigs with phytochrome-like sequences using BLAST. This led to the identification of three contigs with phytochrome-like sequences covering AtphyA-, AtphyC- and AtphyE-like full-length genes. Annotation of the three sequences showed that each contig consists of one phytochrome-like gene with three exons and two introns. BLASTn and BLASTp results indicated that RsphyA mRNA and protein sequences had homologues in Wrightia coccinea and and Solanum tuberosum, respectively. RsphyC-like mRNA and protein sequence were homologous to Vitis vinifera and Vitis riparia. RsphyE-like mRNA coding and protein sequences were homologous to Ipomoea nil. Multiple-sequence alignment of phytochrome proteins indicated a homology with 30 sequences from 23 different species of flowering plants. Phylogenetic analysis confirmed that each R. stricta phytochrome gene is related to the same phytochrome gene of other flowering plants. It is proposed that the absence of phyB gene in R. stricta is due to RsphyA gene taking over the role of phyB. 相似文献
13.
The objective of this study was to obtain a genetically stable haploid in vitro-derived line from Siberian larch ( Larix sibirica Ledeb.) using megagametophyte explants, which then could be used for different molecular genetic studies, including whole genome de novo sequencing. However, cytogenetic analysis and genotyping of 11 microsatellite loci showed high levels of genomic instability and a high frequency of mutation in the obtained megagametophyte-derived callus cultures. All cultures contained new mutations in one or more microsatellite loci. 相似文献
14.
Next-generation sequencing (NGS) approaches rapidly produce millions to billions of short reads, which allow pathogen detection and discovery in human clinical, animal and environmental samples. A major limitation of sequence homology-based identification for highly divergent microorganisms is the short length of reads generated by most highly parallel sequencing technologies. Short reads require a high level of sequence similarities to annotated genes to confidently predict gene function or homology. Such recognition of highly divergent homologues can be improved by reference-free ( de novo) assembly of short overlapping sequence reads into larger contigs. We describe an ensemble strategy that integrates the sequential use of various de Bruijn graph and overlap-layout-consensus assemblers with a novel partitioned sub-assembly approach. We also proposed new quality metrics that are suitable for evaluating metagenome de novo assembly. We demonstrate that this new ensemble strategy tested using in silico spike-in, clinical and environmental NGS datasets achieved significantly better contigs than current approaches. 相似文献
15.
Novosphingobium sp. strain PP1Y is a marine bacterium specifically adapted to use fuels as an energy source. We sequenced and assembled its entire genome using the Roche 454 genome sequencer system, which led to the identification of two plasmids and one megaplasmid, besides a 3.9-Mb circular chromosome. 相似文献
16.
BackgroundThird generation sequencing methods, like SMRT (Single Molecule, Real-Time) sequencing developed by Pacific Biosciences, offer much longer read length in comparison to Next Generation Sequencing (NGS) methods. Hence, they are well suited for de novo- or re-sequencing projects. Sequences generated for these purposes will not only contain reads originating from the nuclear genome, but also a significant amount of reads originating from the organelles of the target organism. These reads are usually discarded but they can also be used for an assembly of organellar replicons. The long read length supports resolution of repetitive regions and repeats within the organelles genome which might be problematic when just using short read data. Additionally, SMRT sequencing is less influenced by GC rich areas and by long stretches of the same base. ResultsWe describe a workflow for a de novo assembly of the sugar beet ( Beta vulgaris ssp. vulgaris) chloroplast genome sequence only based on data originating from a SMRT sequencing dataset targeted on its nuclear genome. We show that the data obtained from such an experiment are sufficient to create a high quality assembly with a higher reliability than assemblies derived from e.g. Illumina reads only. The chloroplast genome is especially challenging for de novo assembling as it contains two large inverted repeat (IR) regions. We also describe some limitations that still apply even though long reads are used for the assembly. ConclusionsSMRT sequencing reads extracted from a dataset created for nuclear genome (re)sequencing can be used to obtain a high quality de novo assembly of the chloroplast of the sequenced organism. Even with a relatively small overall coverage for the nuclear genome it is possible to collect more than enough reads to generate a high quality assembly that outperforms short read based assemblies. However, even with long reads it is not always possible to clarify the order of elements of a chloroplast genome sequence reliantly which we could demonstrate with Fosmid End Sequences (FES) generated with Sanger technology. Nevertheless, this limitation also applies to short read sequencing data but is reached in this case at a much earlier stage during finishing. Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0726-6) contains supplementary material, which is available to authorized users. 相似文献
17.
We have designed a series of 15 short, helical de novo peptides consisting of lysine, isoleucine, and alanine. We have termed this the KIA series. These peptides differ only in their hydrophobic interface, and thus their self-association is largely a consequence of hydrophobic interactions. One of these peptides, KIA13, forms insoluble helical fibers at specific NaCl concentrations. We have used CD spectroscopy, turbidity assays, and in situ tapping mode atomic force microscopy to characterize the reversible assembly pathway for this peptide. It is unfolded at low NaCl concentration, and forms helical, soluble fibers resembling a coiled-coil conformation at intermediate NaCl concentrations, and rope-like insoluble fibers at high NaCl concentrations. Reducing the NaCl concentration completely reverses this process. Another peptide from the KIA series specifically inhibits the formation of the insoluble KIA13 fibers, and reverses the process to some extent. This work sheds light onto protein fibrillogenesis and offers intriguing possibilities for the use of these types of peptides in drug delivery and biomaterials applications. 相似文献
18.
SUMMARY: Many biological papers describe short, functional DNA sites without specifying their exact positions in the genome. We have developed a Web server that automates the tedious task of locating such sites in eukaryotic genomes, thus giving access to the context of rich annotations that are increasingly available for genome sequences. AVAILABILITY: http://zlab.bu.edu/site2genome/ 相似文献
19.
BackgroundDNA methylation is a crucial epigenomic mechanism in various biological processes. Using whole-genome bisulfite sequencing (WGBS) technology, methylated cytosine sites can be revealed at the single nucleotide level. However, the WGBS data analysis process is usually complicated and challenging. ResultsTo alleviate the associated difficulties, we integrated the WGBS data processing steps and downstream analysis into a two-phase approach. First, we set up the required tools in Galaxy and developed workflows to calculate the methylation level from raw WGBS data and generate a methylation status summary, the mtable. This computation environment is wrapped into the Docker container image DocMethyl, which allows users to rapidly deploy an executable environment without tedious software installation and library dependency problems. Next, the mtable files were uploaded to the web server EpiMOLAS_web to link with the gene annotation databases that enable rapid data retrieval and analyses. ConclusionTo our knowledge, the EpiMOLAS framework, consisting of DocMethyl and EpiMOLAS_web, is the first approach to include containerization technology and a web-based system for WGBS data analysis from raw data processing to downstream analysis. EpiMOLAS will help users cope with their WGBS data and also conduct reproducible analyses of publicly available data, thereby gaining insights into the mechanisms underlying complex biological phenomenon. The Galaxy Docker image DocMethyl is available at https://hub.docker.com/r/lsbnb/docmethyl/. EpiMOLAS_web is publicly accessible at http://symbiosis.iis.sinica.edu.tw/epimolas/. 相似文献
20.
Although heritable factors are an important determinant of risk of early-onset cancer, the majority of these malignancies appear to occur sporadically without identifiable risk factors. Germline de novo copy-number variations (CNVs) have been observed in sporadic neurocognitive and cardiovascular disorders. We explored this mechanism in 382 genomes of 116 early-onset cancer case-parent trios and unaffected siblings. Unique de novo germline CNVs were not observed in 107 breast or colon cancer trios or controls but were indeed found in 7% of 43 testicular germ cell tumor trios; this percentage exceeds background CNV rates and suggests a rare de novo genetic paradigm for susceptibility to some human malignancies. 相似文献
|