首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 15 毫秒
Annotated genomes can provide new perspectives on the biology of species. We present the first de novo whole genome sequencing for the pink-footed goose. In order to obtain a high-quality de novo assembly the strategy used was to combine one short insert paired-end library with two mate-pair libraries. The pink-footed goose genome was assembled de novo using three different assemblers and an assembly evaluation was subsequently performed in order to choose the best assembler. For our data, ALLPATHS-LG performed the best, since the assembly produced covers most of the genome, while introducing the fewest errors. A total of 26,134 genes were annotated, with bird species accounting for virtually all BLAST hits. We also estimated the substitution rate in the pink-footed goose, which can be of use in future demographic studies, by using a comparative approach with the genome of the chicken, the mallard and the swan goose. A substitution rate of 1.38 × 10? 7 per nucleotide per generation was obtained when comparing the genomes of the two closely-related goose species (the pink-footed and the swan goose). Altogether, we provide a valuable tool for future genomic studies aiming at particular genes and regions of the pink-footed goose genome as well as other bird species.  相似文献   

Current challenges in de novo plant genome sequencing and assembly   总被引:1,自引:0,他引:1  
Genome sequencing is now affordable, but assembling plant genomes de novo remains challenging. We assess the state of the art of assembly and review the best practices for the community.  相似文献   

The human genome reference (HGR) completion marked the genomics era beginning, yet despite its utility universal application is limited by the small number of individuals used in its development. This is highlighted by the presence of high-quality sequence reads failing to map within the HGR. Sequences failing to map generally represent 2–5 % of total reads, which may harbor regions that would enhance our understanding of population variation, evolution, and disease. Alternatively, complete de novo assemblies can be created, but these effectively ignore the groundwork of the HGR. In an effort to find a middle ground, we developed a bioinformatic pipeline that maps paired-end reads to the HGR as separate single reads, exports unmappable reads, de novo assembles these reads per individual and then combines assemblies into a secondary reference assembly used for comparative analysis. Using 45 diverse 1000 Genomes Project individuals, we identified 351,361 contigs covering 195.5 Mb of sequence unincorporated in GRCh38. 30,879 contigs are represented in multiple individuals with ~40 % showing high sequence complexity. Genomic coordinates were generated for 99.9 %, with 52.5 % exhibiting high-quality mapping scores. Comparative genomic analyses with archaic humans and primates revealed significant sequence alignments and comparisons with model organism RefSeq gene datasets identified novel human genes. If incorporated, these sequences will expand the HGR, but more importantly our data highlight that with this method low coverage (~10–20×) next-generation sequencing can still be used to identify novel unmapped sequences to explore biological functions contributing to human phenotypic variation, disease and functionality for personal genomic medicine.  相似文献   



Genomics studies are being revolutionized by the next generation sequencing technologies, which have made whole genome sequencing much more accessible to the average researcher. Whole genome sequencing with the new technologies is a developing art that, despite the large volumes of data that can be produced, may still fail to provide a clear and thorough map of a genome. The Plantagora project was conceived to address specifically the gap between having the technical tools for genome sequencing and knowing precisely the best way to use them.

Methodology/Principal Findings

For Plantagora, a platform was created for generating simulated reads from several different plant genomes of different sizes. The resulting read files mimicked either 454 or Illumina reads, with varying paired end spacing. Thousands of datasets of reads were created, most derived from our primary model genome, rice chromosome one. All reads were assembled with different software assemblers, including Newbler, Abyss, and SOAPdenovo, and the resulting assemblies were evaluated by an extensive battery of metrics chosen for these studies. The metrics included both statistics of the assembly sequences and fidelity-related measures derived by alignment of the assemblies to the original genome source for the reads. The results were presented in a website, which includes a data graphing tool, all created to help the user compare rapidly the feasibility and effectiveness of different sequencing and assembly strategies prior to testing an approach in the lab. Some of our own conclusions regarding the different strategies were also recorded on the website.


Plantagora provides a substantial body of information for comparing different approaches to sequencing a plant genome, and some conclusions regarding some of the specific approaches. Plantagora also provides a platform of metrics and tools for studying the process of sequencing and assembly further.  相似文献   

We describe a new assembly algorithm, where a genome assembly with low sequence coverage, either throughout the genome or locally, due to cloning bias, is considerably improved through an assisting process via a related genome. We show that the information provided by aligning the whole-genome shotgun reads of the target against a reference genome can be used to substantially improve the quality of the resulting assembly.  相似文献   

Genome assembly has been benefited from long-read sequencing technologies with higher accuracy and higher continuity. However, most human genome assembly require large amount of DNAs from homogeneous cell lines without keeping cell heterogeneities, since cell heterogeneity could profoundly affect haplotype assembly results. Herein, using single-cell genome long-read sequencing technology (SMOOTH-seq), we have sequenced K562 and HG002 cells on PacBio HiFi and Oxford Nanopore Technologies (ONT) platforms and conducted de novo genome assembly. For the first time, we have completed the human genome assembly with high continuity (with NG50 of ∼2 Mb using 95 individual K562 cells) at single-cell levels, and explored the impact of different assemblers and sequencing strategies on genome assembly. With sequencing data from 30 diploid individual HG002 cells of relatively high genome coverage (average coverage ∼41.7%) on ONT platform, the NG50 can reach over 1.3 Mb. Furthermore, with the assembled genome from K562 single-cell dataset, more complete and accurate set of insertion events and complex structural variations could be identified. This study opened a new chapter on the practice of single-cell genome de novo assembly.  相似文献   

We offer a guide to de novo genome assembly1 using sequence data generated by the Illumina platform for biologists working with fungi or other organisms whose genomes are less than 100 Mb in size. The guide requires no familiarity with sequencing assembly technology or associated computer programs. It defines commonly used terms in genome sequencing and assembly; provides examples of assembling short-read genome sequence data for four strains of the fungus Grosmannia clavigera using four assembly programs; gives examples of protocols and software; and presents a commented flowchart that extends from DNA preparation for submission to a sequencing center, through to processing and assembly of the raw sequence reads using freely available operating systems and software.  相似文献   

人类基因组结构变异   总被引:2,自引:0,他引:2  
何永蜀  张闻  杨照青 《遗传》2009,31(8):771-778
基因组结构变异通常是指基因组内大于1 kb的DNA片段缺失、插入、重复、倒位、易位以及DNA拷贝数目变化(CNVs)。人类基因组结构变异涉及数千片段不连续的基因组区域, 含数百万DNA碱基对, 可含数个基因及调控序列, 多种基因功能因此缺失或改变, 导致机体表型变化、疾病易感性改变或发生疾病。对基因组结构变异的研究, 有助于用动态的观点全面分析基因组遗传变异得到整合的基因型, 理解结构变异的潜在医学作用及机体整体功能的复杂性。文章从人类基因组结构变异的类型、研究方法, 对个体表型、疾病及生物进化的影响等方面综合阐述人类基因组结构变异的最新研究进展。  相似文献   

Structural variation in the human genome   总被引:11,自引:0,他引:11  
The first wave of information from the analysis of the human genome revealed SNPs to be the main source of genetic and phenotypic human variation. However, the advent of genome-scanning technologies has now uncovered an unexpectedly large extent of what we term 'structural variation' in the human genome. This comprises microscopic and, more commonly, submicroscopic variants, which include deletions, duplications and large-scale copy-number variants - collectively termed copy-number variants or copy-number polymorphisms - as well as insertions, inversions and translocations. Rapidly accumulating evidence indicates that structural variants can comprise millions of nucleotides of heterogeneity within every genome, and are likely to make an important contribution to human diversity and disease susceptibility.  相似文献   

We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ~280 bp or ~3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed.  相似文献   

Whole genome amplification by the multiple displacement amplification (MDA) method allows sequencing of DNA from single cells of bacteria that cannot be cultured. Assembling a genome is challenging, however, because MDA generates highly nonuniform coverage of the genome. Here we describe an algorithm tailored for short-read data from single cells that improves assembly through the use of a progressively increasing coverage cutoff. Assembly of reads from single Escherichia coli and Staphylococcus aureus cells captures >91% of genes within contigs, approaching the 95% captured from an assembly based on many E. coli cells. We apply this method to assemble a genome from a single cell of an uncultivated SAR324 clade of Deltaproteobacteria, a cosmopolitan bacterial lineage in the global ocean. Metabolic reconstruction suggests that SAR324 is aerobic, motile and chemotaxic. Our approach enables acquisition of genome assemblies for individual uncultivated bacteria using only short reads, providing cell-specific genetic information absent from metagenomic studies.  相似文献   

汪浩  张锐  张娇  沈慧  戴锡玲  严岳鸿 《生物多样性》2019,27(11):1221-29
全基因组复制在动植物中普遍存在, 被认为是促进物种进化的重要动力之一。作为蕨类植物的单种科物种, 翼盖蕨(Didymochlaena trancatula)是真水龙骨类I的基部类群, 在蕨类中具有独特的演化地位。本研究基于高通量测序, 通过同义替换率(Ks)分析、相对定年分析揭示翼盖蕨的全基因组复制发生情况。Ks分析表明, 翼盖蕨至少经历了两次全基因组复制事件, 其中一次发生于59-62 million years ago (Mya), 另一次发生于90-94 Mya, 这两次全基因组复制事件分别和白垩纪第三纪的Cretaceous-Tertiary (C-T)大灭绝事件以及翼盖蕨的物种分化时间相吻合。进一步对两次全基因组复制保留的基因进行功能注释和富集分析, 结果显示与转录及代谢调控相关的基因优势被保留。翼盖蕨的全基因组复制事件可能促进了该物种的分化及其对极端环境的适应性。  相似文献   



Sampling genomes with Fosmid vectors and sequencing of pooled Fosmid libraries on the Illumina platform for massive parallel sequencing is a novel and promising approach to optimizing the trade-off between sequencing costs and assembly quality.


In order to sequence the genome of Norway spruce, which is of great size and complexity, we developed and applied a new technology based on the massive production, sequencing, and assembly of Fosmid pools (FP). The spruce chromosomes were sampled with ~40,000 bp Fosmid inserts to obtain around two-fold genome coverage, in parallel with traditional whole genome shotgun sequencing (WGS) of haploid and diploid genomes. Compared to the WGS results, the contiguity and quality of the FP assemblies were high, and they allowed us to fill WGS gaps resulting from repeats, low coverage, and allelic differences. The FP contig sets were further merged with WGS data using a novel software package GAM-NGS.


By exploiting FP technology, the first published assembly of a conifer genome was sequenced entirely with massively parallel sequencing. Here we provide a comprehensive report on the different features of the approach and the optimization of the process.We have made public the input data (FASTQ format) for the set of pools used in this study:ftp://congenie.org/congenie/Nystedt_2013/Assembly/ProcessedData/FosmidPools/.(alternatively accessible via http://congenie.org/downloads).The software used for running the assembly process is available at http://research.scilifelab.se/andrej_alexeyenko/downloads/fpools/.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-439) contains supplementary material, which is available to authorized users.  相似文献   

Revealing genomic variation of representative and diverse germplasm is the cornerstone of deploying genomics information into genetic improvement programs of species of agricultural importance. Here we report the re-sequencing of 239 japonica rice elites representing the genetic diversity of japonica germplasm in China, Japan and Korea. A total of 4.8 million SNPs and PAV of 35,634 genes were identified. The elites from Japan and Korea are closely related and relatively less diverse than those from China. A japonica rice pan-genome was constructed, and 35 Mb non-redundant novel sequences were identified, from which 1131 novel genes were predicted. Strong selection signals of genomic regions were detected on most of the chromosomes. The heading date genes Hd1 and Hd3a have been artificially selected during the breeding process. The results from this study lay the foundation for future whole genome sequences-enabled breeding in rice and provide a paradigm for other species.  相似文献   

Narzisi G  Mishra B 《PloS one》2011,6(4):e19175
Recent advances in DNA sequencing technology and their focal role in Genome Wide Association Studies (GWAS) have rekindled a growing interest in the whole-genome sequence assembly (WGSA) problem, thereby, inundating the field with a plethora of new formalizations, algorithms, heuristics and implementations. And yet, scant attention has been paid to comparative assessments of these assemblers' quality and accuracy. No commonly accepted and standardized method for comparison exists yet. Even worse, widely used metrics to compare the assembled sequences emphasize only size, poorly capturing the contig quality and accuracy. This paper addresses these concerns: it highlights common anomalies in assembly accuracy through a rigorous study of several assemblers, compared under both standard metrics (N50, coverage, contig sizes, etc.) as well as a more comprehensive metric (Feature-Response Curves, FRC) that is introduced here; FRC transparently captures the trade-offs between contigs' quality against their sizes. For this purpose, most of the publicly available major sequence assemblers--both for low-coverage long (Sanger) and high-coverage short (Illumina) reads technologies--are compared. These assemblers are applied to microbial (Escherichia coli, Brucella, Wolbachia, Staphylococcus, Helicobacter) and partial human genome sequences (Chr. Y), using sequence reads of various read-lengths, coverages, accuracies, and with and without mate-pairs. It is hoped that, based on these evaluations, computational biologists will identify innovative sequence assembly paradigms, bioinformaticists will determine promising approaches for developing "next-generation" assemblers, and biotechnologists will formulate more meaningful design desiderata for sequencing technology platforms. A new software tool for computing the FRC metric has been developed and is available through the AMOS open-source consortium.  相似文献   

MOTIVATION: DNA repeats are a common feature of most genomic sequences. Their de novo identification is still difficult despite being a crucial step in genomic analysis and oligonucleotides design. Several efficient algorithms based on word counting are available, but too short words decrease specificity while long words decrease sensitivity, particularly in degenerated repeats. RESULTS: The Repeat Analysis Program (RAP) is based on a new word-counting algorithm optimized for high resolution repeat identification using gapped words. Many different overlapping gapped words can be counted at the same genomic position, thus producing a better signal than the single ungapped word. This results in better specificity both in terms of low-frequency detection, being able to identify sequences repeated only once, and highly divergent detection, producing a generally high score in most intron sequences. AVAILABILITY: The program is freely available for non-profit organizations, upon request to the authors. CONTACT: giorgio.valle@unipd.it SUPPLEMENTARY INFORMATION: The program has been tested on the Caenorhabditis elegans genome using word lengths of 12, 14 and 16 bases. The full analysis has been implemented in the UCSC Genome Browser and is accessible at http://genome.cribi.unipd.it.  相似文献   

Zhang W  Chen J  Yang Y  Tang Y  Shang J  Shen B 《PloS one》2011,6(3):e17915
The advent of next-generation sequencing technologies is accompanied with the development of many whole-genome sequence assembly methods and software, especially for de novo fragment assembly. Due to the poor knowledge about the applicability and performance of these software tools, choosing a befitting assembler becomes a tough task. Here, we provide the information of adaptivity for each program, then above all, compare the performance of eight distinct tools against eight groups of simulated datasets from Solexa sequencing platform. Considering the computational time, maximum random access memory (RAM) occupancy, assembly accuracy and integrity, our study indicate that string-based assemblers, overlap-layout-consensus (OLC) assemblers are well-suited for very short reads and longer reads of small genomes respectively. For large datasets of more than hundred millions of short reads, De Bruijn graph-based assemblers would be more appropriate. In terms of software implementation, string-based assemblers are superior to graph-based ones, of which SOAPdenovo is complex for the creation of configuration file. Our comparison study will assist researchers in selecting a well-suited assembler and offer essential information for the improvement of existing assemblers or the developing of novel assemblers.  相似文献   

Complete hydatidiform moles (CHMs) are diploid tumors that result from fertilization of an empty ovum by a haploid 23,X sperm. In most cases, the resulting duplication of the genome gives rise to a 46,XX genotype and is thought to be androgenetic in origin. If this hypothesis is correct, then the genotypes of all polymorphic markers in CHMs should be homozygous. We used a dense set of single-nucleotide polymorphism (SNP) markers, evenly spaced throughout the genome, to definitively test this hypothesis. We genotyped genomic DNA samples from five CHMs and their corresponding maternal samples with 1494 SNP markers using high-density microarrays (HuSNP). As predicted, the maternal samples were heterozygous at >25% of the markers, which is consistent with the expected average heterozygosity of this panel of SNPs. In contrast, the five CHM samples were heterozygous at <0.75% of the SNP markers, which shows that these diploid tumors consist of a duplicated set of chromosomes. Because the CHM genotypes represent the haplotypes of their genomes, our results show that long-range haplotypes can be obtained easily with this resource and that a collection of such samples is a simple way to obtain reference haplotypes for association studies in various populations.  相似文献   

Phytochrome-like genes in the wild plant species Rhazya stricta Decne were characterized using a de novo genome assembly of next generation sequence data. Rhazya stricta contains more than 100 alkaloids with multiple pharmacological properties, and leaf extracts have been used to cure chronic rheumatism, to treat tumors, and in the treatment of several other diseases. Phytochromes are known to be involved in the light-regulated biosynthesis of some alkaloids. Phytochromes are soluble chromoproteins that function in the absorption of red and far-red light and the transduction of intracellular signals during light-regulated plant development. De novo assembly of the nuclear genome of Rstricta recovered 45,641 contigs greater than 1000 bp long, which were used in constructing a local database. Five sequences belonging to Arabidopsis thaliana phytochrome gene family (i.e., AtphyABCDE) were used to identify R. stricta contigs with phytochrome-like sequences using BLAST. This led to the identification of three contigs with phytochrome-like sequences covering AtphyA-, AtphyC- and AtphyE-like full-length genes. Annotation of the three sequences showed that each contig consists of one phytochrome-like gene with three exons and two introns. BLASTn and BLASTp results indicated that RsphyA mRNA and protein sequences had homologues in Wrightia coccinea and and Solanum tuberosum, respectively. RsphyC-like mRNA and protein sequence were homologous to Vitis vinifera and Vitis riparia. RsphyE-like mRNA coding and protein sequences were homologous to Ipomoea nil. Multiple-sequence alignment of phytochrome proteins indicated a homology with 30 sequences from 23 different species of flowering plants. Phylogenetic analysis confirmed that each R. stricta phytochrome gene is related to the same phytochrome gene of other flowering plants. It is proposed that the absence of phyB gene in Rstricta is due to RsphyA gene taking over the role of phyB.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号