首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 771 毫秒
1.

Background

Characterizing large genomic variants is essential to expanding the research and clinical applications of genome sequencing. While multiple data types and methods are available to detect these structural variants (SVs), they remain less characterized than smaller variants because of SV diversity, complexity, and size. These challenges are exacerbated by the experimental and computational demands of SV analysis. Here, we characterize the SV content of a personal genome with Parliament, a publicly available consensus SV-calling infrastructure that merges multiple data types and SV detection methods.

Results

We demonstrate Parliament’s efficacy via integrated analyses of data from whole-genome array comparative genomic hybridization, short-read next-generation sequencing, long-read (Pacific BioSciences RSII), long-insert (Illumina Nextera), and whole-genome architecture (BioNano Irys) data from the personal genome of a single subject (HS1011). From this genome, Parliament identified 31,007 genomic loci between 100 bp and 1 Mbp that are inconsistent with the hg19 reference assembly. Of these loci, 9,777 are supported as putative SVs by hybrid local assembly, long-read PacBio data, or multi-source heuristics. These SVs span 59 Mbp of the reference genome (1.8%) and include 3,801 events identified only with long-read data. The HS1011 data and complete Parliament infrastructure, including a BAM-to-SV workflow, are available on the cloud-based service DNAnexus.

Conclusions

HS1011 SV analysis reveals the limits and advantages of multiple sequencing technologies, specifically the impact of long-read SV discovery. With the full Parliament infrastructure, the HS1011 data constitute a public resource for novel SV discovery, software calibration, and personal genome structural variation analysis.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1479-3) contains supplementary material, which is available to authorized users.  相似文献   

2.
Camelids are characterized by their unique adaptive immune system that exhibits the generation of homodimeric heavy‐chain immunoglobulins, somatic hypermutation of T‐cell receptors, and low genetic diversity of major histocompatibility complex (MHC) genes. However, short‐read assemblies are typically highly fragmented in these gene loci owing to their repetitive and polymorphic nature. Here, we constructed a chromosome‐level assembly of wild Bactrian camel genome based on high‐coverage long‐read sequencing and chromatin interaction mapping. The assembly with a contig N50 of 5.37 Mb and a scaffold N50 of 76.03 Mb, represents the most contiguous camelid genome to date. The genomic organization of immunoglobulin heavy‐chain locus was similar between the wild Bactrian camel and alpaca, and genes encoding for conventional and heavy‐chain antibodies were intermixed. The organizations of two immunoglobulin light‐chain loci and four T cell receptor loci were also fully deciphered using the new assembly. Additionally, the complete classical MHC region was resolved into a single contig. The high‐quality assembly presented here provides an essential reference for future investigations examining the camelid immune system.  相似文献   

3.
Pathogens are recognized as major drivers of local adaptation in wildlife systems. By determining which gene variants are favored in local interactions among populations with and without disease, spatially explicit adaptive responses to pathogens can be elucidated. Much of our current understanding of host responses to disease comes from a small number of genes associated with an immune response. High‐throughput sequencing (HTS) technologies, such as genotype‐by‐sequencing (GBS), facilitate expanded explorations of genomic variation among populations. Hybridization‐based GBS techniques can be leveraged in systems not well characterized for specific variants associated with disease outcome to “capture” specific genes and regulatory regions known to influence expression and disease outcome. We developed a multiplexed, sequence capture assay for red foxes to simultaneously assess ~300‐kbp of genomic sequence from 116 adaptive, intrinsic, and innate immunity genes of predicted adaptive significance and their putative upstream regulatory regions along with 23 neutral microsatellite regions to control for demographic effects. The assay was applied to 45 fox DNA samples from Alaska, where three arctic rabies strains are geographically restricted and endemic to coastal tundra regions, yet absent from the boreal interior. The assay provided 61.5% on‐target enrichment with relatively even sequence coverage across all targeted loci and samples (mean = 50×), which allowed us to elucidate genetic variation across introns, exons, and potential regulatory regions (4,819 SNPs). Challenges remained in accurately describing microsatellite variation using this technique; however, longer‐read HTS technologies should overcome these issues. We used these data to conduct preliminary analyses and detected genetic structure in a subset of red fox immune‐related genes between regions with and without endemic arctic rabies. This assay provides a template to assess immunogenetic variation in wildlife disease systems.  相似文献   

4.
The application of next-generation sequencing to estimate genetic diversity of Plasmodium falciparum, the most lethal malaria parasite, has proved challenging due to the skewed AT-richness [∼80.6% (A + T)] of its genome and the lack of technology to assemble highly polymorphic subtelomeric regions that contain clonally variant, multigene virulence families (Ex: var and rifin). To address this, we performed amplification-free, single molecule, real-time sequencing of P. falciparum genomic DNA and generated reads of average length 12 kb, with 50% of the reads between 15.5 and 50 kb in length. Next, using the Hierarchical Genome Assembly Process, we assembled the P. falciparum genome de novo and successfully compiled all 14 nuclear chromosomes telomere-to-telomere. We also accurately resolved centromeres [∼90–99% (A + T)] and subtelomeric regions and identified large insertions and duplications that add extra var and rifin genes to the genome, along with smaller structural variants such as homopolymer tract expansions. Overall, we show that amplification-free, long-read sequencing combined with de novo assembly overcomes major challenges inherent to studying the P. falciparum genome. Indeed, this technology may not only identify the polymorphic and repetitive subtelomeric sequences of parasite populations from endemic areas but may also evaluate structural variation linked to virulence, drug resistance and disease transmission.  相似文献   

5.
Despite remarkable progress in DNA sequencing technologies there remains a trade-off between short-read platforms, having limited ability to sequence homopolymers, repeated motifs or long-range structural variation, and long-read platforms, which tend to have lower accuracy and/or throughput. Moreover, current methods do not allow direct readout of epigenetic modifications from a single read. With the aim of addressing these limitations, we have developed an optical electrowetting sequencing platform that uses step-wise nucleotide triphosphate (dNTP) release, capture and detection in microdroplets from single DNA molecules. Each microdroplet serves as a reaction vessel that identifies an individual dNTP based on a robust fluorescence signal, with the detection chemistry extended to enable detection of 5-methylcytosine. Our platform uses small reagent volumes and inexpensive equipment, paving the way to cost-effective single-molecule DNA sequencing, capable of handling widely varying GC-bias, and demonstrating direct detection of epigenetic modifications.  相似文献   

6.

Background

Several genomes have now been sequenced, with millions of genetic variants annotated. While significant progress has been made in mapping single nucleotide polymorphisms (SNPs) and small (<10 bp) insertion/deletions (indels), the annotation of larger structural variants has been less comprehensive. It is still unclear to what extent a typical genome differs from the reference assembly, and the analysis of the genomes sequenced to date have shown varying results for copy number variation (CNV) and inversions.

Results

We have combined computational re-analysis of existing whole genome sequence data with novel microarray-based analysis, and detect 12,178 structural variants covering 40.6 Mb that were not reported in the initial sequencing of the first published personal genome. We estimate a total non-SNP variation content of 48.8 Mb in a single genome. Our results indicate that this genome differs from the consensus reference sequence by approximately 1.2% when considering indels/CNVs, 0.1% by SNPs and approximately 0.3% by inversions. The structural variants impact 4,867 genes, and >24% of structural variants would not be imputed by SNP-association.

Conclusions

Our results indicate that a large number of structural variants have been unreported in the individual genomes published to date. This significant extent and complexity of structural variants, as well as the growing recognition of their medical relevance, necessitate they be actively studied in health-related analyses of personal genomes. The new catalogue of structural variants generated for this genome provides a crucial resource for future comparison studies.  相似文献   

7.
人类基因组结构变异   总被引:2,自引:0,他引:2  
何永蜀  张闻  杨照青 《遗传》2009,31(8):771-778
基因组结构变异通常是指基因组内大于1 kb的DNA片段缺失、插入、重复、倒位、易位以及DNA拷贝数目变化(CNVs)。人类基因组结构变异涉及数千片段不连续的基因组区域, 含数百万DNA碱基对, 可含数个基因及调控序列, 多种基因功能因此缺失或改变, 导致机体表型变化、疾病易感性改变或发生疾病。对基因组结构变异的研究, 有助于用动态的观点全面分析基因组遗传变异得到整合的基因型, 理解结构变异的潜在医学作用及机体整体功能的复杂性。文章从人类基因组结构变异的类型、研究方法, 对个体表型、疾病及生物进化的影响等方面综合阐述人类基因组结构变异的最新研究进展。  相似文献   

8.
The genetic basis of general plant vigor is of major interest to food producers, yet the trait is recalcitrant to genetic mapping because of the number of loci involved, their small effects, and linkage. Observations of heterosis in many crops suggests that recessive, malfunctioning versions of genes are a major cause of poor performance, yet we have little information on the mutational spectrum underlying these disruptions. To address this question, we generated a long-read assembly of a tropical japonica rice (Oryza sativa) variety, Carolina Gold, which allowed us to identify structural mutations (>50 bp) and orient them with respect to their ancestral state using the outgroup, Oryza glaberrima. Supporting prior work, we find substantial genome expansion in the sativa branch. While transposable elements (TEs) account for the largest share of size variation, the majority of events are not directly TE-mediated. Tandem duplications are the most common source of insertions and are highly enriched among 50-200bp mutations. To explore the relative impact of various mutational classes on crop fitness, we then track these structural events over the last century of US rice improvement using 101 resequenced varieties. Within this material, a pattern of temporary hybridization between medium and long-grain varieties was followed by recent divergence. During this long-term selection, structural mutations that impact gene exons have been removed at a greater rate than intronic indels and single-nucleotide mutations. These results support the use of ab initio estimates of mutational burden, based on structural data, as an orthogonal predictor in genomic selection.  相似文献   

9.
10.
Sequence assembly of large and repeat-rich plant genomes has been challenging, requiring substantial computational resources and often several complementary sequence assembly and genome mapping approaches. The recent development of fast and accurate long-read sequencing by circular consensus sequencing (CCS) on the PacBio platform may greatly increase the scope of plant pan-genome projects. Here, we compare current long-read sequencing platforms regarding their ability to rapidly generate contiguous sequence assemblies in pan-genome studies of barley (Hordeum vulgare). Most long-read assemblies are clearly superior to the current barley reference sequence based on short-reads. Assemblies derived from accurate long reads excel in most metrics, but the CCS approach was the most cost-effective strategy for assembling tens of barley genomes. A downsampling analysis indicated that 20-fold CCS coverage can yield very good sequence assemblies, while even five-fold CCS data may capture the complete sequence of most genes. We present an updated reference genome assembly for barley with near-complete representation of the repeat-rich intergenic space. Long-read assembly can underpin the construction of accurate and complete sequences of multiple genomes of a species to build pan-genome infrastructures in Triticeae crops and their wild relatives.

A greatly improved reference genome sequence of barley was assembled from accurate long reads.  相似文献   

11.
Here we use whole-genome de novo assembly of second-generation sequencing reads to map structural variation (SV) in an Asian genome and an African genome. Our approach identifies small- and intermediate-size homozygous variants (1-50 kb) including insertions, deletions, inversions and their precise breakpoints, and in contrast to other methods, can resolve complex rearrangements. In total, we identified 277,243 SVs ranging in length from 1-23 kb. Validation using computational and experimental methods suggests that we achieve overall <6% false-positive rate and <10% false-negative rate in genomic regions that can be assembled, which outperforms other methods. Analysis of the SVs in the genomes of 106 individuals sequenced as part of the 1000 Genomes Project suggests that SVs account for a greater fraction of the diversity between individuals than do single-nucleotide polymorphisms (SNPs). These findings demonstrate that whole-genome de novo assembly is a feasible approach to deriving more comprehensive maps of genetic variation.  相似文献   

12.
Population-scale genome sequencing allows the characterization of functional effects of a broad spectrum of genetic variants underlying human phenotypic variation. Here, we investigate the influence of rare and common genetic variants on gene expression patterns, using variants identified from sequencing data from the 1000 genomes project in an African and European population sample and gene expression data from lymphoblastoid cell lines. We detect comparable numbers of expression quantitative trait loci (eQTLs) when compared to genotypes obtained from HapMap 3, but as many as 80% of the top expression quantitative trait variants (eQTVs) discovered from 1000 genomes data are novel. The properties of the newly discovered variants suggest that mapping common causal regulatory variants is challenging even with full resequencing data; however, we observe significant enrichment of regulatory effects in splice-site and nonsense variants. Using RNA sequencing data, we show that 46.2% of nonsynonymous variants are differentially expressed in at least one individual in our sample, creating widespread potential for interactions between functional protein-coding and regulatory variants. We also use allele-specific expression to identify putative rare causal regulatory variants. Furthermore, we demonstrate that outlier expression values can be due to rare variant effects, and we approximate the number of such effects harboured in an individual by effect size. Our results demonstrate that integration of genomic and RNA sequencing analyses allows for the joint assessment of genome sequence and genome function.  相似文献   

13.
While individual non-B DNA structures have been shown to impact gene expression, their broad regulatory role remains elusive. We utilized genomic variants and expression quantitative trait loci (eQTL) data to analyze genome-wide variation propensities of potential non-B DNA regions and their relation to gene expression. Independent of genomic location, these regions were enriched in nucleotide variants. Our results are consistent with previously observed mutagenic properties of these regions and counter a previous study concluding that G-quadruplex regions have a reduced frequency of variants. While such mutagenicity might undermine functionality of these elements, we identified in potential non-B DNA regions a signature of negative selection. Yet, we found a depletion of eQTL-associated variants in potential non-B DNA regions, opposite to what might be expected from their proposed regulatory role. However, we also observed that genes downstream of potential non-B DNA regions showed higher expression variation between individuals. This coupling between mutagenicity and tolerance for expression variability of downstream genes may be a result of evolutionary adaptation, which allows reconciling mutagenicity of non-B DNA structures with their location in functionally important regions and their potential regulatory role.  相似文献   

14.
Genome-wide association studies (GWAS) have identified >500 common variants associated with quantitative metabolic traits, but in aggregate such variants explain at most 20–30% of the heritable component of population variation in these traits. To further investigate the impact of genotypic variation on metabolic traits, we conducted re-sequencing studies in >6,000 members of a Finnish population cohort (The Northern Finland Birth Cohort of 1966 [NFBC]) and a type 2 diabetes case-control sample (The Finland-United States Investigation of NIDDM Genetics [FUSION] study). By sequencing the coding sequence and 5′ and 3′ untranslated regions of 78 genes at 17 GWAS loci associated with one or more of six metabolic traits (serum levels of fasting HDL-C, LDL-C, total cholesterol, triglycerides, plasma glucose, and insulin), and conducting both single-variant and gene-level association tests, we obtained a more complete understanding of phenotype-genotype associations at eight of these loci. At all eight of these loci, the identification of new associations provides significant evidence for multiple genetic signals to one or more phenotypes, and at two loci, in the genes ABCA1 and CETP, we found significant gene-level evidence of association to non-synonymous variants with MAF<1%. Additionally, two potentially deleterious variants that demonstrated significant associations (rs138726309, a missense variant in G6PC2, and rs28933094, a missense variant in LIPC) were considerably more common in these Finnish samples than in European reference populations, supporting our prior hypothesis that deleterious variants could attain high frequencies in this isolated population, likely due to the effects of population bottlenecks. Our results highlight the value of large, well-phenotyped samples for rare-variant association analysis, and the challenge of evaluating the phenotypic impact of such variants.  相似文献   

15.
Acquisition of genetic material from viruses by their hosts can generate inter-host structural genome variation. We developed computational tools enabling us to study virus-derived structural variants (SVs) in population-scale whole genome sequencing (WGS) datasets and applied them to 3,332 humans. Although SVs had already been cataloged in these subjects, we found previously-overlooked virus-derived SVs. We detected non-germline SVs derived from squirrel monkey retrovirus (SMRV), human immunodeficiency virus 1 (HIV-1), and human T lymphotropic virus (HTLV-1); these variants are attributable to infection of the sequenced lymphoblastoid cell lines (LCLs) or their progenitor cells and may impact gene expression results and the biosafety of experiments using these cells. In addition, we detected new heritable SVs derived from human herpesvirus 6 (HHV-6) and human endogenous retrovirus-K (HERV-K). We report the first solo-direct repeat (DR) HHV-6 likely to reflect DR rearrangement of a known full-length endogenous HHV-6. We used linkage disequilibrium between single nucleotide variants (SNVs) and variants in reads that align to HERV-K, which often cannot be mapped uniquely using conventional short-read sequencing analysis methods, to locate previously-unknown polymorphic HERV-K loci. Some of these loci are tightly linked to trait-associated SNVs, some are in complex genome regions inaccessible by prior methods, and some contain novel HERV-K haplotypes likely derived from gene conversion from an unknown source or introgression. These tools and results broaden our perspective on the coevolution between viruses and humans, including ongoing virus-to-human gene transfer contributing to genetic variation between humans.  相似文献   

16.
Genome structural variation (SV) is a major source of genetic diversity in mammals and a hallmark of cancer. Although SV is typically defined by its canonical forms (duplication, deletion, insertion, inversion and translocation), recent breakpoint mapping studies have revealed a surprising number of 'complex' variants that evade simple classification. Complex SVs are defined by clustered breakpoints that arose through a single mutation but cannot be explained by one simple end-joining or recombination event. Some complex variants exhibit profoundly complicated rearrangements between distinct loci from multiple chromosomes, whereas others involve more subtle alterations at a single locus. These diverse and unpredictable features present a challenge for SV mapping experiments. Here, we review current knowledge of complex SV in mammals, and outline techniques for identifying and characterizing complex variants using next-generation DNA sequencing.  相似文献   

17.
18.
The autoimmune thyroid diseases (AITD) include Graves' disease (GD) and Hashimoto's thyroiditis (HT), which are characterised by a breakdown in immune tolerance to thyroid antigens. Unravelling the genetic architecture of AITD is vital to better understanding of AITD pathogenesis, required to advance therapeutic options in both disease management and prevention. The early whole-genome linkage and candidate gene association studies provided the first evidence that the HLA region and CTLA-4 represented AITD risk loci. Recent improvements in; high throughput genotyping technologies, collection of larger disease cohorts and cataloguing of genome-scale variation have facilitated genome-wide association studies and more thorough screening of candidate gene regions. This has allowed identification of many novel AITD risk genes and more detailed association mapping. The growing number of confirmed AITD susceptibility loci, implicates a number of putative disease mechanisms most of which are tightly linked with aspects of immune system function. The unprecedented advances in genetic study will allow future studies to identify further novel disease risk genes and to identify aetiological variants within specific gene regions, which will undoubtedly lead to a better understanding of AITD patho-physiology.  相似文献   

19.
The domestic dog serves as an excellent model to investigate the genetic basis of disease. More than 400 heritable traits analogous to human diseases have been described in dogs. To further canine medical genetics research, we established the Dog Biomedical Variant Database Consortium (DBVDC) and present a comprehensive list of functionally annotated genome variants that were identified with whole genome sequencing of 582 dogs from 126 breeds and eight wolves. The genomes used in the study have a minimum coverage of 10× and an average coverage of ~24×. In total, we identified 23 133 692 single‐nucleotide variants (SNVs) and 10 048 038 short indels, including 93% undescribed variants. On average, each individual dog genome carried ~4.1 million single‐nucleotide and ~1.4 million short‐indel variants with respect to the reference genome assembly. About 2% of the variants were located in coding regions of annotated genes and loci. Variant effect classification showed 247 141 SNVs and 99 562 short indels having moderate or high impact on 11 267 protein‐coding genes. On average, each genome contained heterozygous loss‐of‐function variants in 30 potentially embryonic lethal genes and 97 genes associated with developmental disorders. More than 50 inherited disorders and traits have been unravelled using the DBVDC variant catalogue, enabling genetic testing for breeding and diagnostics. This resource of annotated variants and their corresponding genotype frequencies constitutes a highly useful tool for the identification of potential variants causative for rare inherited disorders in dogs.  相似文献   

20.
Most natural populations display substantial genetic variation in behaviour, morphology, physiology, life history and the susceptibility to disease. A major challenge is to determine the contributions of individual loci to variation in complex traits. Quantitative trait locus (QTL) mapping has identified genomic regions affecting ecologically significant traits of many species. In nearly all cases, however, the importance of these QTLs to population variation remains unclear. In this paper, we apply a novel experimental method to parse the genetic variance of floral traits of the annual plant Mimulus guttatus into contributions of individual QTLs. We first use QTL-mapping to identify nine loci and then conduct a population-based breeding experiment to estimate V(Q), the genetic variance attributable to each QTL. We find that three QTLs with moderate effects explain up to one-third of the genetic variance in the natural population. Variation at these loci is probably maintained by some form of balancing selection. Notably, the largest effect QTLs were relatively minor in their contribution to heritability.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号