首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

Many tools exist to predict structural variants (SVs), utilizing a variety of algorithms. However, they have largely been developed and tested on human germline or somatic (e.g. cancer) variation. It seems appropriate to exploit this wealth of technology available for humans also for other species. Objectives of this work included:
  1. Creating an automated, standardized pipeline for SV prediction.
  2. Identifying the best tool(s) for SV prediction through benchmarking.
  3. Providing a statistically sound method for merging SV calls.

Results

The SV-AUTOPILOT meta-tool platform is an automated pipeline for standardization of SV prediction and SV tool development in paired-end next-generation sequencing (NGS) analysis. SV-AUTOPILOT comes in the form of a virtual machine, which includes all datasets, tools and algorithms presented here. The virtual machine easily allows one to add, replace and update genomes, SV callers and post-processing routines and therefore provides an easy, out-of-the-box environment for complex SV discovery tasks. SV-AUTOPILOT was used to make a direct comparison between 7 popular SV tools on the Arabidopsis thaliana genome using the Landsberg (Ler) ecotype as a standardized dataset. Recall and precision measurements suggest that Pindel and Clever were the most adaptable to this dataset across all size ranges while Delly performed well for SVs larger than 250 nucleotides. A novel, statistically-sound merging process, which can control the false discovery rate, reduced the false positive rate on the Arabidopsis benchmark dataset used here by >60%.

Conclusion

SV-AUTOPILOT provides a meta-tool platform for future SV tool development and the benchmarking of tools on other genomes using a standardized pipeline. It optimizes detection of SVs in non-human genomes using statistically robust merging. The benchmarking in this study has demonstrated the power of 7 different SV tools for analyzing different size classes and types of structural variants. The optional merge feature enriches the call set and reduces false positives providing added benefit to researchers planning to validate SVs. SV-AUTOPILOT is a powerful, new meta-tool for biologists as well as SV tool developers.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1376-9) contains supplementary material, which is available to authorized users.  相似文献   

2.

Background

Genomic deletions, inversions, and other rearrangements known collectively as structural variations (SVs) are implicated in many human disorders. Technologies for sequencing DNA provide a potentially rich source of information in which to detect breakpoints of structural variations at base-pair resolution. However, accurate prediction of SVs remains challenging, and existing informatics tools predict rearrangements with significant rates of false positives or negatives.

Results

To address this challenge, we developed ‘Structural Variation detection by STAck and Tail’ (SV-STAT) which implements a novel scoring metric. The software uses this statistic to quantify evidence for structural variation in genomic regions suspected of harboring rearrangements. To demonstrate SV-STAT, we used targeted and genome-wide approaches. First, we applied a custom capture array followed by Roche/454 and SV-STAT to three pediatric B-lineage acute lymphoblastic leukemias, identifying five structural variations joining known and novel breakpoint regions. Next, we detected SVs genome-wide in paired-end Illumina data collected from additional tumor samples. SV-STAT showed predictive accuracy as high as or higher than leading alternatives. The software is freely available under the terms of the GNU General Public License version 3 at https://gitorious.org/svstat/svstat.

Conclusions

SV-STAT works across multiple sequencing chemistries, paired and single-end technologies, targeted or whole-genome strategies, and it complements existing SV-detection software. The method is a significant advance towards accurate detection and genotyping of genomic rearrangements from DNA sequencing data.
  相似文献   

3.

Background

Accurate catalogs of structural variants (SVs) in mammalian genomes are necessary to elucidate the potential mechanisms that drive SV formation and to assess their functional impact. Next generation sequencing methods for SV detection are an advance on array-based methods, but are almost exclusively limited to four basic types: deletions, insertions, inversions and copy number gains.

Results

By visual inspection of 100 Mbp of genome to which next generation sequence data from 17 inbred mouse strains had been aligned, we identify and interpret 21 paired-end mapping patterns, which we validate by PCR. These paired-end mapping patterns reveal a greater diversity and complexity in SVs than previously recognized. In addition, Sanger-based sequence analysis of 4,176 breakpoints at 261 SV sites reveal additional complexity at approximately a quarter of structural variants analyzed. We find micro-deletions and micro-insertions at SV breakpoints, ranging from 1 to 107 bp, and SNPs that extend breakpoint micro-homology and may catalyze SV formation.

Conclusions

An integrative approach using experimental analyses to train computational SV calling is essential for the accurate resolution of the architecture of SVs. We find considerable complexity in SV formation; about a quarter of SVs in the mouse are composed of a complex mixture of deletion, insertion, inversion and copy number gain. Computational methods can be adapted to identify most paired-end mapping patterns.  相似文献   

4.
Genome-wide association studies (GWAS) have evolved over the last ten years into a powerful tool for investigating the genetic architecture of human disease. In this work, we review the key concepts underlying GWAS, including the architecture of common diseases, the structure of common human genetic variation, technologies for capturing genetic information, study designs, and the statistical methods used for data analysis. We also look forward to the future beyond GWAS.

What to Learn in This Chapter

  • Basic genetic concepts that drive genome-wide association studies
  • Genotyping technologies and common study designs
  • Statistical concepts for GWAS analysis
  • Replication, interpretation, and follow-up of association results
This article is part of the “Translational Bioinformatics” collection for PLOS Computational Biology.
  相似文献   

5.
6.
7.
  1. Download : Download high-res image (299KB)
  2. Download : Download full-size image
Highlights► Individual DNA molecules hundreds of kbp long may be stretched and visualized by optical microscopy. ► An optical barcode is generated by fluorescent labeling of short sequence motifs along the stretched DNA. ► Optical maps complement DNA sequencing for gap closing, finishing, validation and de novo assembly of genomes. ► Genome structural variations not accessible to sequencing or DNA arrays may be directly visualized. ► Epigenetic marks such as DNA methylation and DNA binding proteins may also be mapped on single genomic fragments.  相似文献   

8.

Background

Characterizing large genomic variants is essential to expanding the research and clinical applications of genome sequencing. While multiple data types and methods are available to detect these structural variants (SVs), they remain less characterized than smaller variants because of SV diversity, complexity, and size. These challenges are exacerbated by the experimental and computational demands of SV analysis. Here, we characterize the SV content of a personal genome with Parliament, a publicly available consensus SV-calling infrastructure that merges multiple data types and SV detection methods.

Results

We demonstrate Parliament’s efficacy via integrated analyses of data from whole-genome array comparative genomic hybridization, short-read next-generation sequencing, long-read (Pacific BioSciences RSII), long-insert (Illumina Nextera), and whole-genome architecture (BioNano Irys) data from the personal genome of a single subject (HS1011). From this genome, Parliament identified 31,007 genomic loci between 100 bp and 1 Mbp that are inconsistent with the hg19 reference assembly. Of these loci, 9,777 are supported as putative SVs by hybrid local assembly, long-read PacBio data, or multi-source heuristics. These SVs span 59 Mbp of the reference genome (1.8%) and include 3,801 events identified only with long-read data. The HS1011 data and complete Parliament infrastructure, including a BAM-to-SV workflow, are available on the cloud-based service DNAnexus.

Conclusions

HS1011 SV analysis reveals the limits and advantages of multiple sequencing technologies, specifically the impact of long-read SV discovery. With the full Parliament infrastructure, the HS1011 data constitute a public resource for novel SV discovery, software calibration, and personal genome structural variation analysis.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1479-3) contains supplementary material, which is available to authorized users.  相似文献   

9.

Background

Generation of long (>5 Kb) DNA sequencing reads provides an approach for interrogation of complex regions in the human genome. Currently, large-insert whole genome sequencing (WGS) technologies from Pacific Biosciences (PacBio) enable analysis of chromosomal structural variations (SVs), but the cost to achieve the required sequence coverage across the entire human genome is high.

Results

We developed a method (termed PacBio-LITS) that combines oligonucleotide-based DNA target-capture enrichment technologies with PacBio large-insert library preparation to facilitate SV studies at specific chromosomal regions. PacBio-LITS provides deep sequence coverage at the specified sites at substantially reduced cost compared with PacBio WGS. The efficacy of PacBio-LITS is illustrated by delineating the breakpoint junctions of low copy repeat (LCR)-associated complex structural rearrangements on chr17p11.2 in patients diagnosed with Potocki–Lupski syndrome (PTLS; MIM#610883). We successfully identified previously determined breakpoint junctions in three PTLS cases, and also were able to discover novel junctions in repetitive sequences, including LCR-mediated breakpoints. The new information has enabled us to propose mechanisms for formation of these structural variants.

Conclusions

The new method leverages the cost efficiency of targeted capture-sequencing as well as the mappability and scaffolding capabilities of long sequencing reads generated by the PacBio platform. It is therefore suitable for studying complex SVs, especially those involving LCRs, inversions, and the generation of chimeric Alu elements at the breakpoints. Other genomic research applications, such as haplotype phasing and small insertion and deletion validation could also benefit from this technology.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1370-2) contains supplementary material, which is available to authorized users.  相似文献   

10.

Background

To gain biological insights into lung metastases from hepatocellular carcinoma (HCC), we compared the whole-genome sequencing profiles of primary HCC and paired lung metastases.

Methods

We used whole-genome sequencing at 33X-43X coverage to profile somatic mutations in primary HCC (HBV+) and metachronous lung metastases (> 2 years interval).

Results

In total, 5,027-13,961 and 5,275-12,624 somatic single-nucleotide variants (SNVs) were detected in primary HCC and lung metastases, respectively. Generally, 38.88-78.49% of SNVs detected in metastases were present in primary tumors. We identified 65–221 structural variations (SVs) in primary tumors and 60–232 SVs in metastases. Comparison of these SVs shows very similar and largely overlapped mutated segments between primary and metastatic tumors. Copy number alterations between primary and metastatic pairs were also found to be closely related. Together, these preservations in genomic profiles from liver primary tumors to metachronous lung metastases indicate that the genomic features during tumorigenesis may be retained during metastasis.

Conclusions

We found very similar genomic alterations between primary and metastatic tumors, with a few mutations found specifically in lung metastases, which may explain the clinical observation that both primary and metastatic tumors are usually sensitive or resistant to the same systemic treatments.  相似文献   

11.
Highlights? Theoretical model describing mutational processes operative in cancer genomes ? Computational framework for deciphering signatures of mutational processes ? Extensive evaluation of the computational framework with simulated data ? Application to mutational catalogs of breast cancer genomes and exomes  相似文献   

12.
Copy number variation (CNV) contributes in phenotypically relevant ways to the genetic variability of many organisms. Cost-effective genomewide methods for identifying copy number variation are necessary to elucidate the contribution that these structural variants make to the genomes of model organisms. We have developed a novel approach for the identification of copy number variation by next generation sequencing. As a proof of concept our method has been applied to map the deletions of three Drosophila deficiency strains. We demonstrate that low sequence coverage is sufficient for identifying and mapping large deletions at kilobase resolution, suggesting that data generated from high-throughput sequencing experiments are sufficient for simultaneously analyzing many strains. Genomic DNA from two Drosophila deficiency stocks was barcoded and sequenced in multiplex, and the breakpoints associated with each deletion were successfully identified. The approach we describe is immediately applicable to the systematic exploration of copy number variation in model organisms and humans.STRUCTURAL variation is known to contribute extensively to the genetic variability of humans, mammals, and many model organisms. One class of structural variant, termed copy number variation (CNV), includes deletions, duplications, insertions, and genomic rearrangements which affect the number of occurrences of a specific DNA sequence present in the genome (Redon et al. 2006). CNV is known to occur extensively in the Drosophila genome with functionally significant consequences (Bridges 1936; Dopman and Hartl 2007; Tibshirani and Wang 2008; Zhou et al. 2008). In one study of 15 Drosophila strains, as many as 10% of genes were observed to harbor CNVs (Emerson et al. 2008). Cryptic CNVs that affect the phenotype observed in a model organism have the potential to confound research on multiple levels. For example, a recent report indicates that terminal deletions on chromosome (chr) 2L are frequent among deficiency kit stocks with mutations on the second chromosome and that the associated deletion of lgl has distorted the results of several previous studies (Roegiers et al. 2009). Despite widespread existence of CNV, the biological consequences of this phenomenon remain largely unexplored due to the lack of efficient tools for detection and characterization.Until recently, comparative genomic hybridization with whole-genome tiling arrays (array-CGH) was the primary method for characterizing CNVs (Carter 2007); however, several limitations for this platform reduce its efficacy and efficiency. First, cross-hybridization and reliance on intensity scores lead to data that are difficult to interpret. Second, custom array design and optimization is labor intensive and costly. Third, array-CGH methods can only detect CNV, not other complex rearrangements such as balanced translocations and inversions. Finally, the overall cost of array-CGH methods is relatively high, particularly when high-resolution, whole-genome tiling arrays are employed.Direct sequencing using next-generation technology has several advantages that make it a potentially powerful alternative to array-CGH for identifying genomic structural variations, including deletions, duplications, and rearrangements (Campbell et al. 2008; Chiang et al. 2009). First, high-throughput sequencing methods overcome the inherent limitations of cross-hybridization and provide a digital count of sequence representation. Second, no prior knowledge or design work is necessary. Third, using paired-end sequencing it is possible to identify complex structural variations. Finally, the current cost of CNV discovery by sequencing is comparable or lower than that of array-CGH and is continuing to decline.In this report, we describe a sequencing-based strategy for high-throughput, cost-effective, genomewide characterization of structural variation at fine resolution by employing the Illumina sequencing platform. Deletions in three deficiency fly stocks were successfully characterized and the associated breakpoints were accurately determined. As we demonstrate, high-throughput sequencing provides an ideal and cost-effective platform for CNV characterization.  相似文献   

13.
14.
15.

Background

Double minute chromosomes are circular fragments of DNA whose presence is associated with the onset of certain cancers. Double minutes are lethal, as they are highly amplified and typically contain oncogenes. Locating double minutes can supplement the process of cancer diagnosis, and it can help to identify therapeutic targets. However, there is currently a dearth of computational methods available to identify double minutes. We propose a computational framework for the idenfication of double minute chromosomes using next-generation sequencing data. Our framework integrates predictions from algorithms that detect DNA copy number variants, and it also integrates predictions from algorithms that locate genomic structural variants. This information is used by a graph-based algorithm to predict the presence of double minute chromosomes.

Results

Using a previously published copy number variant algorithm and two structural variation prediction algorithms, we implemented our framework and tested it on a dataset consisting of simulated double minute chromosomes. Our approach uncovered double minutes with high accuracy, demonstrating its plausibility.

Conclusions

Although we only tested the framework with three programs (RDXplorer, BreakDancer, Delly), it can be extended to incorporate results from programs that 1) detect amplified copy number and from programs that 2) detect genomic structural variants like deletions, translocations, inversions, and tandem repeats.The software that implements the framework can be accessed here: https://github.com/mhayes20/DMFinder
  相似文献   

16.
Disease-causing aberrations in the normal function of a gene define that gene as a disease gene. Proving a causal link between a gene and a disease experimentally is expensive and time-consuming. Comprehensive prioritization of candidate genes prior to experimental testing drastically reduces the associated costs. Computational gene prioritization is based on various pieces of correlative evidence that associate each gene with the given disease and suggest possible causal links. A fair amount of this evidence comes from high-throughput experimentation. Thus, well-developed methods are necessary to reliably deal with the quantity of information at hand. Existing gene prioritization techniques already significantly improve the outcomes of targeted experimental studies. Faster and more reliable techniques that account for novel data types are necessary for the development of new diagnostics, treatments, and cure for many diseases.
This article is part of the “Translational Bioinformatics" collection for PLOS Computational Biology.

What to Learn in This Chapter

  • Identification of specific disease genes is complicated by gene pleiotropy, polygenic nature of many diseases, varied influence of environmental factors, and overlying genome variation.
  • Gene prioritization is the process of assigning likelihood of gene involvement in generating a disease phenotype. This approach narrows down, and arranges in the order of likelihood in disease involvement, the set of genes to be tested experimentally.
  • The gene “priority" in disease is assigned by considering a set of relevant features such as gene expression and function, pathway involvement, and mutation effects.
  • In general, disease genes tend to 1) interact with other disease genes, 2) harbor functionally deleterious mutations, 3) code for proteins localizing to the affected biological compartment (pathway, cellular space, or tissue), 4) have distinct sequence properties such as longer length and a higher number of exons, 5) have more orthologues and fewer paralogues.
  • Data sources (directly experimental, extracted from knowledge-bases, or text-mining based) and mathematical/computational models used for gene prioritization vary widely.
  相似文献   

17.
Peach belongs to the genus Prunus, which includes Prunus persica and its relative species, P. mira, P. davidiana, P. kansuensis, and P. ferganensis. Of these, P. ferganensis have been classified as a species, subspecies, or geographical population of P. persica. To explore the genetic difference between P. ferganensis and P. persica, high-throughput sequencing was used in different peach accessions belonging to different species. First, low-depth sequencing data of peach accessions belonging to four categories revealed that similarity between P. ferganensis and P. persica was similar to that between P. persica accessions from different geographical populations. Then, to further detect the genomic variation in P. ferganensis, the P. ferganensis accession “Xinjiang Pan Tao 1” and the P. persica accession “Xia Miao 1” were sequenced with high depth, and sequence reads were assembled. The results showed that the collinearity of “Xinjiang Pan Tao 1” with the reference genome “Lovell” was higher than that of “Xia Miao 1” and “Lovell” peach. Additionally, the number of genetic variants, including single nucleotide polymorphisms (SNPs), structural variations (SVs), and the specific genes annotated from unmapped sequence in “Xia Miao 1” was higher than that in “Xinjiang Pan Tao 1” peach. The data showed that there was a close distance between “Xinjiang Pan Tao 1” (P. ferganensis) and reference genome which belong to P. persica, comparing “Xia Miao 1” (P. persica) and reference ones. The results accompany with phylogenetic tree and structure analysis confirmed that P. ferganensis should be considered as a geographic population of P. persica rather than a subspecies or a distinct species. Furthermore, gene ontology analysis was performed using the gene comprising large-effect variation to understand the phenotypic difference between two accessions. The result revealed that the pathways of gene function affected by SVs but SNPs and insertion-deletions markedly differed between the two peach accessions.  相似文献   

18.
Structural genomic variations play an important role in human disease and phenotypic diversity. With the rise of high-throughput sequencing tools, mate-pair/paired-end/single-read sequencing has become an important technique for the detection and exploration of structural variation. Several analysis tools exist to handle different parts and aspects of such sequencing based structural variation analyses pipelines. A comprehensive analysis platform to handle all steps, from processing the sequencing data, to the discovery and visualization of structural variants, is missing. The ViVar platform is built to handle the discovery of structural variants, from Depth Of Coverage analysis, aberrant read pair clustering to split read analysis. ViVar provides you with powerful visualization options, enables easy reporting of results and better usability and data management. The platform facilitates the processing, analysis and visualization, of structural variation based on massive parallel sequencing data, enabling the rapid identification of disease loci or genes. ViVar allows you to scale your analysis with your work load over multiple (cloud) servers, has user access control to keep your data safe and is easy expandable as analysis techniques advance. URL: https://www.cmgg.be/vivar/  相似文献   

19.
Modern experimental strategies often generate genome-scale measurements of human tissues or cell lines in various physiological states. Investigators often use these datasets individually to help elucidate molecular mechanisms of human diseases. Here we discuss approaches that effectively weight and integrate hundreds of heterogeneous datasets to gene-gene networks that focus on a specific process or disease. Diverse and systematic genome-scale measurements provide such approaches both a great deal of power and a number of challenges. We discuss some such challenges as well as methods to address them. We also raise important considerations for the assessment and evaluation of such approaches. When carefully applied, these integrative data-driven methods can make novel high-quality predictions that can transform our understanding of the molecular-basis of human disease.

What to Learn in This Chapter

  • What a functional relationship network represents.
  • The fundamentals of Bayesian inference for genomic data integration.
  • How to build a network of functional relationships between genes using examples of functionally related genes and diverse experimental data.
  • How computational scientists study disease using data driven approaches, such as integrated networks of protein-protein functional relationships.
  • Strategies to assess predictions from a functional relationship network
This article is part of the “Translational Bioinformatics” collection for PLOS Computational Biology.
  相似文献   

20.
The etiologic paradigm of complex human disorders such as autism is that genetic and environmental risk factors are independent and additive, but the interactive effects at the epigenetic interface are largely ignored. Genomic technologies have radically changed perspective on the human genome and how the epigenetic interface may impact complex human disorders. Here, I review recent genomic, environmental and epigenetic findings that suggest a new paradigm of “integrative genomics” in which genetic variation in genomic size may be impacted by dietary and environmental factors that influence the genomic saturation of DNA methylation. Human genomes are highly repetitive, but the interface of large-scale genomic differences with environmental factors that alter the DNA methylome such as dietary folate is under-explored. In addition to obvious direct effects of some environmental toxins on the genome by causing chromosomal breaks, non-mutagenic toxin exposures correlate with DNA hypomethylation that can lead to rearrangements between repeats or increased retrotransposition. Since human neurodevelopment appears to be particularly sensitive to alterations in epigenetic pathways, a further focus will be on how developing neurons may be particularly impacted by even subtle alterations to DNA methylation and proposing new directions towards understanding the quixotic etiology of autism by integrative genomic approaches.Key words: DNA methylation, copy number variation, autism, neurodevelopment, genomics, epigenomics, epigenetics, folate, folic acid, environmental exposures, Alu, MeCP2, LINE-1  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号