首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We performed whole-genome resequencing of 12 field isolates and eight commonly studied laboratory strains of the model organism Chlamydomonas reinhardtii to characterize genomic diversity and provide a resource for studies of natural variation. Our data support previous observations that Chlamydomonas is among the most diverse eukaryotic species. Nucleotide diversity is ∼3% and is geographically structured in North America with some evidence of admixture among sampling locales. Examination of predicted loss-of-function mutations in field isolates indicates conservation of genes associated with core cellular functions, while genes in large gene families and poorly characterized genes show a greater incidence of major effect mutations. De novo assembly of unmapped reads recovered genes in the field isolates that are absent from the CC-503 assembly. The laboratory reference strains show a genomic pattern of polymorphism consistent with their origin as the recombinant progeny of a diploid zygospore. Large duplications or amplifications are a prominent feature of laboratory strains and appear to have originated under laboratory culture. Extensive natural variation offers a new source of genetic diversity for studies of Chlamydomonas, including naturally occurring alleles that may prove useful in studies of gene function and the dissection of quantitative genetic traits.  相似文献   

2.
Gene and SNP annotation are among the first and most important steps in analyzing a genome. As the number of sequenced genomes continues to grow, a key question is: how does the quality of the assembled sequence affect the annotations? We compared the gene and SNP annotations for two different Bos taurus genome assemblies built from the same data but with significant improvements in the later assembly. The same annotation software was used for annotating both sequences. While some annotation differences are expected even between high-quality assemblies such as these, we found that a staggering 40% of the genes (>9,500) varied significantly between assemblies, due in part to the availability of new gene evidence but primarily to genome mis-assembly events and local sequence variations. For instance, although the later assembly is generally superior, 660 protein coding genes in the earlier assembly are entirely missing from the later genome''s annotation, and approximately 3,600 (15%) of the genes have complex structural differences between the two assemblies. In addition, 12–20% of the predicted proteins in both assemblies have relatively large sequence differences when compared to their RefSeq models, and 6–15% of bovine dbSNP records are unrecoverable in the two assemblies. Our findings highlight the consequences of genome assembly quality on gene and SNP annotation and argue for continued improvements in any draft genome sequence. We also found that tracking a gene between different assemblies of the same genome is surprisingly difficult, due to the numerous changes, both small and large, that occur in some genes. As a side benefit, our analyses helped us identify many specific loci for improvement in the Bos taurus genome assembly.  相似文献   

3.
Despite its role as a reference organism in the plant sciences, the green alga Chlamydomonas reinhardtii entirely lacks genomic resources from closely related species. We present highly contiguous and well-annotated genome assemblies for three unicellular C. reinhardtii relatives: Chlamydomonas incerta, Chlamydomonas schloesseri, and the more distantly related Edaphochlamys debaryana. The three Chlamydomonas genomes are highly syntenous with similar gene contents, although the 129.2 Mb C. incerta and 130.2 Mb C. schloesseri assemblies are more repeat-rich than the 111.1 Mb C. reinhardtii genome. We identify the major centromeric repeat in C. reinhardtii as a LINE transposable element homologous to Zepp (the centromeric repeat in Coccomyxa subellipsoidea) and infer that centromere locations and structure are likely conserved in C. incerta and C. schloesseri. We report extensive rearrangements, but limited gene turnover, between the minus mating type loci of these Chlamydomonas species. We produce an eight-species core-Reinhardtinia whole-genome alignment, which we use to identify several hundred false positive and missing genes in the C. reinhardtii annotation and >260,000 evolutionarily conserved elements in the C. reinhardtii genome. In summary, these resources will enable comparative genomics analyses for C. reinhardtii, significantly extending the analytical toolkit for this emerging model system.

High-quality genome assemblies and annotations for three of the closest relatives of Chlamydomonas reinhardtii enable comparative genomics analyses.  相似文献   

4.
Accurate and complete genome sequences are essential in biotechnology to facilitate genome‐based cell engineering efforts. The current genome assemblies for Cricetulus griseus, the Chinese hamster, are fragmented and replete with gap sequences and misassemblies, consistent with most short‐read‐based assemblies. Here, we completely resequenced C. griseus using single molecule real time sequencing and merged this with Illumina‐based assemblies. This generated a more contiguous and complete genome assembly than either technology alone, reducing the number of scaffolds by >28‐fold, with 90% of the sequence in the 122 longest scaffolds. Most genes are now found in single scaffolds, including up‐ and downstream regulatory elements, enabling improved study of noncoding regions. With >95% of the gap sequence filled, important Chinese hamster ovary cell mutations have been detected in draft assembly gaps. This new assembly will be an invaluable resource for continued basic and pharmaceutical research.  相似文献   

5.
6.

Background

Problems associated with using draft genome assemblies are well documented and have become more pronounced with the use of short read data for de novo genome assembly. We set out to improve the draft genome assembly of the African cichlid fish, Metriaclima zebra, using a set of Pacific Biosciences SMRT sequencing reads corresponding to 16.5× coverage of the genome. Here we characterize the improvements that these long reads allowed us to make to the state-of-the-art draft genome previously assembled from short read data.

Results

Our new assembly closed 68 % of the existing gaps and added 90.6Mbp of new non-gap sequence to the existing draft assembly of M. zebra. Comparison of the new assembly to the sequence of several bacterial artificial chromosome clones confirmed the accuracy of the new assembly. The closure of sequence gaps revealed thousands of new exons, allowing significant improvement in gene models. We corrected one known misassembly, and identified and fixed other likely misassemblies. 63.5 Mbp (70 %) of the new sequence was classified as repetitive and the new sequence allowed for the assembly of many more transposable elements.

Conclusions

Our improvements to the M. zebra draft genome suggest that a reasonable investment in long reads could greatly improve many comparable vertebrate draft genome assemblies.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1930-5) contains supplementary material, which is available to authorized users.  相似文献   

7.
Lee RW  Lemieux C 《Genetics》1986,113(3):589-600
The first two non-Mendelian gene mutations to be identified in Chlamydomonas moewusii are described. These putative chloroplast gene mutations include one for resistance to streptomycin (sr-nM1) and one for resistance to erythromycin (er-nM1). In one- and two-factor reciprocal crosses, usually over 90% of the germinating zygospores transmitted these mutations and their wild-type alternatives from both parents (biparental zygospores); the remaining zygospores transmitted exclusively the non-Mendelian markers of the mating-type "plus" parent. Among the biparental zygospores, a strong bias in the transmission of non-Mendelian alleles from the mating-type "plus" parent was indicated by an excess of meiotic and postmeiotic mitotic progeny that were homoplasmic for non-Mendelian alleles from this parent compared to those that were homoplasmic for the non-Mendelian alleles from the mating-type "minus" parent. At best, weak linkage was detected between the sr-nM1 and er-nM1 loci. Non-Mendelian, chloroplast gene markers in Chlamydomonas eugametos and Chlamydomonas reinhardtii showed a predominantly uniparental mode of transmission from the mating-type "plus" parent in crosses performed under the same conditions used for the C. moewusii crosses.  相似文献   

8.
As a greater number and diversity of high-quality vertebrate reference genomes become available, it is increasingly feasible to use these references to guide new draft assemblies for related species. Reference-guided assembly approaches may substantially increase the contiguity and completeness of a new genome using only low levels of genome coverage that might otherwise be insufficient for de novo genome assembly. We used low-coverage (∼3.5–5.5x) Illumina paired-end sequencing to assemble draft genomes of two bird species (the Gunnison Sage-Grouse, Centrocercus minimus, and the Clark''s Nutcracker, Nucifraga columbiana). We used these data to estimate de novo genome assemblies and reference-guided assemblies, and compared the information content and completeness of these assemblies by comparing CEGMA gene set representation, repeat element content, simple sequence repeat content, and GC isochore structure among assemblies. Our results demonstrate that even lower-coverage genome sequencing projects are capable of producing informative and useful genomic resources, particularly through the use of reference-guided assemblies.  相似文献   

9.
10.
11.
12.
13.
High-throughput DNA sequencing technologies have revolutionized genomic analysis, including the de novo assembly of whole genomes. Nevertheless, assembly of complex genomes remains challenging, in part due to the presence of dispersed repeats which introduce ambiguity during genome reconstruction. Transposable elements (TEs) can be particularly problematic, especially for TE families exhibiting high sequence identity, high copy number, or complex genomic arrangements. While TEs strongly affect genome function and evolution, most current de novo assembly approaches cannot resolve long, identical, and abundant families of TEs. Here, we applied a novel Illumina technology called TruSeq synthetic long-reads, which are generated through highly-parallel library preparation and local assembly of short read data and which achieve lengths of 1.5–18.5 Kbp with an extremely low error rate (0.03% per base). To test the utility of this technology, we sequenced and assembled the genome of the model organism Drosophila melanogaster (reference genome strain y; cn, bw, sp) achieving an N50 contig size of 69.7 Kbp and covering 96.9% of the euchromatic chromosome arms of the current reference genome. TruSeq synthetic long-read technology enables placement of individual TE copies in their proper genomic locations as well as accurate reconstruction of TE sequences. We entirely recovered and accurately placed 4,229 (77.8%) of the 5,434 annotated transposable elements with perfect identity to the current reference genome. As TEs are ubiquitous features of genomes of many species, TruSeq synthetic long-reads, and likely other methods that generate long-reads, offer a powerful approach to improve de novo assemblies of whole genomes.  相似文献   

14.
Gap closure is a challenging phase in microbial random shotgun genome sequencing projects, particularly since genome assemblies are often complicated by the presence of repeat elements, insertion sequences and other similar factors that contribute to sequence misassemblies. While it is well recognized that the conservation of genetic information between microbial genomes, combined with the exponential increase in available microbial sequences, can be exploited to increase the efficiency of gap closure, we lack the computational tools to aid in this process. We describe here a new tool, MGView, which was developed to create a graphical depiction of the alignment of a set of microbial contigs against a completed microbial genome. The results of our assembly of the Staphylococcus aureus RF122 genome show that MGView enables a considerable reduction in time and economic cost associated with closure. Together, the results also show that the application of MGView not only enables a reduction in fold-coverage requirements of the random shotgun sequence phase, but also provides interesting insights into differences in gene content and organization between finished and unfinished microbial genomes.  相似文献   

15.
The sequence of Ciona savignyi was determined using a whole-genome shotgun strategy, but a high degree of polymorphism resulted in a fractured assembly wherein allelic sequences from the same genomic region assembled separately. We designed a multistep strategy to generate a nonredundant reference sequence from the original assembly by reconstructing and aligning the two 'haplomes' (haploid genomes). In the resultant 174 megabase reference sequence, each locus is represented once, misassemblies are corrected, and contiguity and continuity are dramatically improved.  相似文献   

16.
Sequence assembly of large and repeat-rich plant genomes has been challenging, requiring substantial computational resources and often several complementary sequence assembly and genome mapping approaches. The recent development of fast and accurate long-read sequencing by circular consensus sequencing (CCS) on the PacBio platform may greatly increase the scope of plant pan-genome projects. Here, we compare current long-read sequencing platforms regarding their ability to rapidly generate contiguous sequence assemblies in pan-genome studies of barley (Hordeum vulgare). Most long-read assemblies are clearly superior to the current barley reference sequence based on short-reads. Assemblies derived from accurate long reads excel in most metrics, but the CCS approach was the most cost-effective strategy for assembling tens of barley genomes. A downsampling analysis indicated that 20-fold CCS coverage can yield very good sequence assemblies, while even five-fold CCS data may capture the complete sequence of most genes. We present an updated reference genome assembly for barley with near-complete representation of the repeat-rich intergenic space. Long-read assembly can underpin the construction of accurate and complete sequences of multiple genomes of a species to build pan-genome infrastructures in Triticeae crops and their wild relatives.

A greatly improved reference genome sequence of barley was assembled from accurate long reads.  相似文献   

17.
The unicellular green alga Chlamydomonas reinhardtii is a model organism for various studies in biology. CC-124 is a laboratory strain widely used as a wild type. However, this strain is known to carry agg1 mutation, which causes cells to swim away from the light source (negative phototaxis), in contrast to the cells of other wild-type strains, which swim toward the light source (positive phototaxis). Here we identified the causative gene of agg1 (AGG1) using AFLP-based gene mapping and whole genome next-generation sequencing. This gene encodes a 36-kDa protein containing a Fibronectin type III domain and a CHORD-Sgt1 (CS) domain. The gene product is localized to the cell body and not to flagella or basal body.  相似文献   

18.
We report reference‐quality genome assemblies and annotations for two accessions of soybean (Glycine max) and for one accession of Glycine soja, the closest wild relative of G. max. The G. max assemblies provided are for widely used US cultivars: the northern line Williams 82 (Wm82) and the southern line Lee. The Wm82 assembly improves the prior published assembly, and the Lee and G. soja assemblies are new for these accessions. Comparisons among the three accessions show generally high structural conservation, but nucleotide difference of 1.7 single‐nucleotide polymorphisms (snps) per kb between Wm82 and Lee, and 4.7 snps per kb between these lines and G. soja. snp distributions and comparisons with genotypes of the Lee and Wm82 parents highlight patterns of introgression and haplotype structure. Comparisons against the US germplasm collection show placement of the sequenced accessions relative to global soybean diversity. Analysis of a pan‐gene collection shows generally high conservation, with variation occurring primarily in genomically clustered gene families. We found approximately 40–42 inversions per chromosome between either Lee or Wm82v4 and G. soja, and approximately 32 inversions per chromosome between Wm82 and Lee. We also investigated five domestication loci. For each locus, we found two different alleles with functional differences between G. soja and the two domesticated accessions. The genome assemblies for multiple cultivated accessions and for the closest wild ancestor of soybean provides a valuable set of resources for identifying causal variants that underlie traits for the domestication and improvement of soybean, serving as a basis for future research and crop improvement efforts for this important crop species.  相似文献   

19.
Advances in DNA sequencing have made it feasible to gather genomic data for non‐model organisms and large sets of individuals, often using methods for sequencing subsets of the genome. Several of these methods sequence DNA associated with endonuclease restriction sites (various RAD and GBS methods). For use in taxa without a reference genome, these methods rely on de novo assembly of fragments in the sequencing library. Many of the software options available for this application were originally developed for other assembly types and we do not know their accuracy for reduced representation libraries. To address this important knowledge gap, we simulated data from the Arabidopsis thaliana and Homo sapiens genomes and compared de novo assemblies by six software programs that are commonly used or promising for this purpose (ABySS , CD‐HIT , Stacks , Stacks2 , Velvet and VSEARCH ). We simulated different mutation rates and types of mutations, and then applied the six assemblers to the simulated data sets, varying assembly parameters. We found substantial variation in software performance across simulations and parameter settings. ABySS failed to recover any true genome fragments, and Velvet and VSEARCH performed poorly for most simulations. Stacks and Stacks2 produced accurate assemblies of simulations containing SNPs, but the addition of insertion and deletion mutations decreased their performance. CD‐HIT was the only assembler that consistently recovered a high proportion of true genome fragments. Here, we demonstrate the substantial difference in the accuracy of assemblies from different software programs and the importance of comparing assemblies that result from different parameter settings.  相似文献   

20.
Novel sequences are DNA sequences present in an individual''s genome but absent in the human reference assembly. They are predicted to be biologically important, both individual and population specific, and consistent with the known human migration paths. Recent works have shown that an average person harbors 2–5 Mb of such sequences and estimated that the human pan-genome contains as high as 19–40 Mb of novel sequences. To identify them in a de novo genome assembly, some existing sequence aligners have been used but no computational method has been specifically proposed for this task. In this work, we developed NSIT (Novel Sequence Identification Tool), a software that can accurately and efficiently identify novel sequences in an individual''s de novo whole genome assembly. We identified and characterized 1.1 Mb, 1.2 Mb, and 1.0 Mb of novel sequences in NA18507 (African), YH (Asian), and NA12878 (European) de novo genome assemblies, respectively. Our results show very high concordance with the previous work using the respective reference assembly. In addition, our results using the latest human reference assembly suggest that the amount of novel sequences per individual may not be as high as previously reported. We additionally developed a graphical viewer for comparisons of novel sequence contents. The viewer also helped in identifying sequence contamination; we found 130 kb of Epstein-Barr virus sequence in the previously published NA18507 novel sequences as well as 287 kb of zebrafish repeats in NA12878 de novo assembly. NSIT requires 2GB of RAM and 1.5–2 hrs on a commodity desktop. The program is applicable to input assemblies with varying contig/scaffold sizes, ranging from 100 bp to as high as 50 Mb. It works in both 32-bit and 64-bit systems and outperforms, by large margins, other fast sequence aligners previously applied to this task. To our knowledge, NSIT is the first software designed specifically for novel sequence identification in a de novo human genome assembly.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号