期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

ConPADE: Genome Assembly Ploidy Estimation from Next-Generation Sequencing Data

Gabriel R. A. Margarido David Heckerman 《PLoS computational biology》2015,11(4)

As a result of improvements in genome assembly algorithms and the ever decreasing costs of high-throughput sequencing technologies, new high quality draft genome sequences are published at a striking pace. With well-established methodologies, larger and more complex genomes are being tackled, including polyploid plant genomes. Given the similarity between multiple copies of a basic genome in polyploid individuals, assembly of such data usually results in collapsed contigs that represent a variable number of homoeologous genomic regions. Unfortunately, such collapse is often not ideal, as keeping contigs separate can lead both to improved assembly and also insights about how haplotypes influence phenotype. Here, we describe a first step in avoiding inappropriate collapse during assembly. In particular, we describe ConPADE (Contig Ploidy and Allele Dosage Estimation), a probabilistic method that estimates the ploidy of any given contig/scaffold based on its allele proportions. In the process, we report findings regarding errors in sequencing. The method can be used for whole genome shotgun (WGS) sequencing data. We also show applicability of the method for variant calling and allele dosage estimation. Results for simulated and real datasets are discussed and provide evidence that ConPADE performs well as long as enough sequencing coverage is available, or the true contig ploidy is low. We show that ConPADE may also be used for related applications, such as the identification of duplicated genes in fragmented assemblies, although refinements are needed. 相似文献

2.

Identification of Optimum Sequencing Depth Especially for De Novo Genome Assembly of Small Genomes Using Next Generation Sequencing Data

Aarti Desai Veer Singh Marwah Akshay Yadav Vineet Jha Kishor Dhaygude Ujwala Bangar Vivek Kulkarni Abhay Jere 《PloS one》2013,8(4)

Next Generation Sequencing (NGS) is a disruptive technology that has found widespread acceptance in the life sciences research community. The high throughput and low cost of sequencing has encouraged researchers to undertake ambitious genomic projects, especially in de novo genome sequencing. Currently, NGS systems generate sequence data as short reads and de novo genome assembly using these short reads is computationally very intensive. Due to lower cost of sequencing and higher throughput, NGS systems now provide the ability to sequence genomes at high depth. However, currently no report is available highlighting the impact of high sequence depth on genome assembly using real data sets and multiple assembly algorithms. Recently, some studies have evaluated the impact of sequence coverage, error rate and average read length on genome assembly using multiple assembly algorithms, however, these evaluations were performed using simulated datasets. One limitation of using simulated datasets is that variables such as error rates, read length and coverage which are known to impact genome assembly are carefully controlled. Hence, this study was undertaken to identify the minimum depth of sequencing required for de novo assembly for different sized genomes using graph based assembly algorithms and real datasets. Illumina reads for E.coli (4.6 MB) S.kudriavzevii (11.18 MB) and C.elegans (100 MB) were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. Our analysis shows that 50X is the optimum read depth for assembling these genomes using all assemblers except Meraculous which requires 100X read depth. Moreover, our analysis shows that de novo assembly from 50X read data requires only 6–40 GB RAM depending on the genome size and assembly algorithm used. We believe that this information can be extremely valuable for researchers in designing experiments and multiplexing which will enable optimum utilization of sequencing as well as analysis resources. 相似文献

3.

BAC-Pool Sequencing and Analysis of Large Segments of A12 and D12 Homoeologous Chromosomes in Upland Cotton

Ramesh Buyyarapu Ramesh V. Kantety John Z. Yu Zhanyou Xu Russell J. Kohel Richard G. Percy Simone Macmil Graham B. Wiley Bruce A. Roe Govind C. Sharma 《PloS one》2013,8(10)

Although new and emerging next-generation sequencing (NGS) technologies have reduced sequencing costs significantly, much work remains to implement them for de novo sequencing of complex and highly repetitive genomes such as the tetraploid genome of Upland cotton (Gossypium hirsutum L.). Herein we report the results from implementing a novel, hybrid Sanger/454-based BAC-pool sequencing strategy using minimum tiling path (MTP) BACs from Ctg-3301 and Ctg-465, two large genomic segments in A12 and D12 homoeologous chromosomes (Ctg). To enable generation of longer contig sequences in assembly, we implemented a hybrid assembly method to process ~35x data from 454 technology and 2.8-3x data from Sanger method. Hybrid assemblies offered higher sequence coverage and better sequence assemblies. Homology studies revealed the presence of retrotransposon regions like Copia and Gypsy elements in these contigs and also helped in identifying new genomic SSRs. Unigenes were anchored to the sequences in Ctg-3301 and Ctg-465 to support the physical map. Gene density, gene structure and protein sequence information derived from protein prediction programs were used to obtain the functional annotation of these genes. Comparative analysis of both contigs with Arabidopsis genome exhibited synteny and microcollinearity with a conserved gene order in both genomes. This study provides insight about use of MTP-based BAC-pool sequencing approach for sequencing complex polyploid genomes with limited constraints in generating better sequence assemblies to build reference scaffold sequences. Combining the utilities of MTP-based BAC-pool sequencing with current longer and short read NGS technologies in multiplexed format would provide a new direction to cost-effectively and precisely sequence complex plant genomes. 相似文献

4.

Integrating Sequencing Technologies in Personal Genomics: Optimal Low Cost Reconstruction of Structural Variants

Jiang Du Robert D. Bjornson Zhengdong D. Zhang Yong Kong Michael Snyder Mark B. Gerstein 《PLoS computational biology》2009,5(7)

The goal of human genome re-sequencing is obtaining an accurate assembly of an individual's genome. Recently, there has been great excitement in the development of many technologies for this (e.g. medium and short read sequencing from companies such as 454 and SOLiD, and high-density oligo-arrays from Affymetrix and NimbelGen), with even more expected to appear. The costs and sensitivities of these technologies differ considerably from each other. As an important goal of personal genomics is to reduce the cost of re-sequencing to an affordable point, it is worthwhile to consider optimally integrating technologies. Here, we build a simulation toolbox that will help us optimally combine different technologies for genome re-sequencing, especially in reconstructing large structural variants (SVs). SV reconstruction is considered the most challenging step in human genome re-sequencing. (It is sometimes even harder than de novo assembly of small genomes because of the duplications and repetitive sequences in the human genome.) To this end, we formulate canonical problems that are representative of issues in reconstruction and are of small enough scale to be computationally tractable and simulatable. Using semi-realistic simulations, we show how we can combine different technologies to optimally solve the assembly at low cost. With mapability maps, our simulations efficiently handle the inhomogeneous repeat-containing structure of the human genome and the computational complexity of practical assembly algorithms. They quantitatively show how combining different read lengths is more cost-effective than using one length, how an optimal mixed sequencing strategy for reconstructing large novel SVs usually also gives accurate detection of SNPs/indels, how paired-end reads can improve reconstruction efficiency, and how adding in arrays is more efficient than just sequencing for disentangling some complex SVs. Our strategy should facilitate the sequencing of human genomes at maximum accuracy and low cost. 相似文献

5.

SpecHap: a diploid phasing algorithm based on spectral graph theory

Yonghan YU Lingxi Chen Xinyao Miao Shuai Cheng Li 《Nucleic acids research》2021,49(19):e114

Haplotype phasing plays an important role in understanding the genetic data of diploid eukaryotic organisms. Different sequencing technologies (such as next-generation sequencing or third-generation sequencing) produce various genetic data that require haplotype assembly. Although multiple diploid haplotype phasing algorithms exist, only a few will work equally well across all sequencing technologies. In this work, we propose SpecHap, a novel haplotype assembly tool that leverages spectral graph theory. On both in silico and whole-genome sequencing datasets, SpecHap consumed less memory and required less CPU time, yet achieved comparable accuracy with state-of-art methods across all the test instances, which comprises sequencing data from next-generation sequencing, linked-reads, high-throughput chromosome conformation capture, PacBio single-molecule real-time, and Oxford Nanopore long-reads. Furthermore, SpecHap successfully phased an individual Ambystoma mexicanum, a species with gigantic diploid genomes, within 6 CPU hours and 945MB peak memory usage, while other tools failed to yield results either due to memory overflow (40GB) or time limit exceeded (5 days). Our results demonstrated that SpecHap is scalable, efficient, and accurate for diploid phasing across many sequencing platforms. 相似文献

6.

高通量测序技术在农作物全基因组序列测定中的应用概览

孙健冬《生物技术进展》2012,2(1):11-15

近几年飞速发展的高通量测序技术(next generation sequencing,NGS)在生命科学研究的各个领域充分展现了其低成本、高通量和应用面广等优势。在现代农业生物技术领域,利用高通量测序技术,科学家们不仅能更经济而高效对农作物、模式植物或不同栽培品种进行深入的全基因组测序、重测序,也可以对成百上千的栽培品种进行高效而准确的遗传差异分析、分子标记分析、连锁图谱分析、表观遗传学分析、转录组分析,进而改进农作物的育种技术,加快新品种的育种研究。其中,获得农作物的全基因组序列是其他研究和分析的基础。本文通过介绍近年来发表的一些利用高通量测序技术进行的农作物全基因组测定和组装的工作,展示高通量测序技术在现代农业生物技术领域的广泛前景以及其建立起来的研究基础。相似文献

7.

Assembly errors cause false tandem duplicate regions in the chicken (Gallus gallus) genome sequence

Qu Zhang Niclas Backström 《Chromosoma》2014,123(1-2):165-168

The complexity of eukaryote genomes makes assembly errors inevitable in the process of constructing reference genomes. Next-generation sequencing (NGS) could provide an efficient way to validate previously assembled genomes. Here, we exploited NGS data to interrogate the chicken reference genome and identified 35 pairs of nearly identical regions with >99.5 % sequence similarity and a median size of 109 kb. Several lines of evidence, including read depth, the composition of junction sequences, and sequence similarity, suggest that these regions present genome assembly errors and should be excluded from forthcoming genomic studies. 相似文献

8.

Reference-free SNP calling: improved accuracy by preventing incorrect calls from repetitive genomic regions 总被引：1，自引：0，他引：1

Dou J Zhao X Fu X Jiao W Wang N Zhang L Hu X Wang S Bao Z 《Biology direct》2012,7(1):17-9

ABSTRACT: BACKGROUND: Single nucleotide polymorphisms (SNPs) are the most abundant type of genetic variation in eukaryotic genomes and have recently become the marker of choice in a wide variety of ecological and evolutionary studies. The advent of next-generation sequencing (NGS) technologies has made it possible to efficiently genotype a large number of SNPs in the non-model organisms with no or limited genomic resources. Most NGS-based genotyping methods require a reference genome to perform accurate SNP calling. Little effort, however, has yet been devoted to developing or improving algorithms for accurate SNP calling in the absence of a reference genome. RESULTS: Here we describe an improved maximum likelihood (ML) algorithm called iML, which can achieve high genotyping accuracy for SNP calling in the non-model organisms without a reference genome. The iML algorithm incorporates the mixed Poisson/normal model to detect composite read clusters and can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions. Through analysis of simulation and real sequencing datasets, we demonstrate that in comparison with ML or a threshold approach, iML can remarkably improve the accuracy of de novo SNP genotyping and is especially powerful for the reference-free genotyping in diploid genomes with high repeat contents. CONCLUSIONS: The iML algorithm can efficiently prevent incorrect SNP calls resulting from repetitive genomic regions, and thus outperforms the original ML algorithm by achieving much higher genotyping accuracy. Our algorithm is therefore very useful for accurate de novo SNP genotyping in the non-model organisms without a reference genome. 相似文献

9.

Full-length haplotype reconstruction to infer the structure of heterogeneous virus populations

Francesca Di Giallonardo Armin T?pfer Melanie Rey Sandhya Prabhakaran Yannick Duport Christine Leemann Stefan Schmutz Nottania K. Campbell Beda Joos Maria Rita Lecca Andrea Patrignani Martin D?umer Christian Beisel Peter Rusert Alexandra Trkola Huldrych F. Günthard Volker Roth Niko Beerenwinkel Karin J. Metzner 《Nucleic acids research》2014,42(14):e115

Next-generation sequencing (NGS) technologies enable new insights into the diversity of virus populations within their hosts. Diversity estimation is currently restricted to single-nucleotide variants or to local fragments of no more than a few hundred nucleotides defined by the length of sequence reads. To study complex heterogeneous virus populations comprehensively, novel methods are required that allow for complete reconstruction of the individual viral haplotypes. Here, we show that assembly of whole viral genomes of ∼8600 nucleotides length is feasible from mixtures of heterogeneous HIV-1 strains derived from defined combinations of cloned virus strains and from clinical samples of an HIV-1 superinfected individual. Haplotype reconstruction was achieved using optimized experimental protocols and computational methods for amplification, sequencing and assembly. We comparatively assessed the performance of the three NGS platforms 454 Life Sciences/Roche, Illumina and Pacific Biosciences for this task. Our results prove and delineate the feasibility of NGS-based full-length viral haplotype reconstruction and provide new tools for studying evolution and pathogenesis of viruses. 相似文献

10.

Rapid Genome Mapping in Nanochannel Arrays for Highly Complete and Accurate De Novo Sequence Assembly of the Complex Aegilops tauschii Genome

Alex R. Hastie Lingli Dong Alexis Smith Jeff Finklestein Ernest T. Lam Naxin Huo Han Cao Pui-Yan Kwok Karin R. Deal Jan Dvorak Ming-Cheng Luo Yong Gu Ming Xiao 《PloS one》2013,8(2)

Next-generation sequencing (NGS) technologies have enabled high-throughput and low-cost generation of sequence data; however, de novo genome assembly remains a great challenge, particularly for large genomes. NGS short reads are often insufficient to create large contigs that span repeat sequences and to facilitate unambiguous assembly. Plant genomes are notorious for containing high quantities of repetitive elements, which combined with huge genome sizes, makes accurate assembly of these large and complex genomes intractable thus far. Using two-color genome mapping of tiling bacterial artificial chromosomes (BAC) clones on nanochannel arrays, we completed high-confidence assembly of a 2.1-Mb, highly repetitive region in the large and complex genome of Aegilops tauschii, the D-genome donor of hexaploid wheat (Triticum aestivum). Genome mapping is based on direct visualization of sequence motifs on single DNA molecules hundreds of kilobases in length. With the genome map as a scaffold, we anchored unplaced sequence contigs, validated the initial draft assembly, and resolved instances of misassembly, some involving contigs <2 kb long, to dramatically improve the assembly from 75% to 95% complete. 相似文献

11.

Whole genome sequencing and its applications in medical genetics

Jiaxin Wu Mengmeng Wu Ting Chen Rui Jiang 《Quantitative Biology.》2016,4(2):115

Fundamental improvement was made for genome sequencing since the next-generation sequencing (NGS) came out in the 2000s. The newer technologies make use of the power of massively-parallel short-read DNA sequencing, genome alignment and assembly methods to digitally and rapidly search the genomes on a revolutionary scale, which enable large-scale whole genome sequencing (WGS) accessible and practical for researchers. Nowadays, whole genome sequencing is more and more prevalent in detecting the genetics of diseases, studying causative relations with cancers, making genome-level comparative analysis, reconstruction of human population history, and giving clinical implications and instructions. In this review, we first give a typical pipeline of whole genome sequencing, including the lab template preparation, sequencing, genome assembling and quality control, variants calling and annotations. We compare the difference between whole genome and whole exome sequencing (WES), and explore a wide range of applications of whole genome sequencing for both mendelian diseases and complex diseases in medical genetics. We highlight the impact of whole genome sequencing in cancer studies, regulatory variant analysis, predictive medicine and precision medicine, as well as discuss the challenges of the whole genome sequencing. 相似文献

12.

An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis

Cristian Del Fabbro Simone Scalabrin Michele Morgante Federico M. Giorgi 《PloS one》2013,8(12)

Next Generation Sequencing is having an extremely strong impact in biological and medical research and diagnostics, with applications ranging from gene expression quantification to genotyping and genome reconstruction. Sequencing data is often provided as raw reads which are processed prior to analysis 1 of the most used preprocessing procedures is read trimming, which aims at removing low quality portions while preserving the longest high quality part of a NGS read. In the current work, we evaluate nine different trimming algorithms in four datasets and three common NGS-based applications (RNA-Seq, SNP calling and genome assembly). Trimming is shown to increase the quality and reliability of the analysis, with concurrent gains in terms of execution time and computational resources needed. 相似文献

13.

Linked‐read sequencing enables haplotype‐resolved resequencing at population scale

Dave Lutgen Raphael Ritter Remi‐Andr Olsen Holger Schielzeth Joel Gruselius Philip Ewels Jesús T. García Hadoram Shirihai Manuel Schweizer Alexander Suh Reto Burri 《Molecular ecology resources》2020,20(5):1311-1322

The feasibility to sequence entire genomes of virtually any organism provides unprecedented insights into the evolutionary history of populations and species. Nevertheless, many population genomic inferences – including the quantification and dating of admixture, introgression and demographic events, and inference of selective sweeps – are still limited by the lack of high‐quality haplotype information. The newest generation of sequencing technology now promises significant progress. To establish the feasibility of haplotype‐resolved genome resequencing at population scale, we investigated properties of linked‐read sequencing data of songbirds of the genus Oenanthe across a range of sequencing depths. Our results based on the comparison of downsampled (25×, 20×, 15×, 10×, 7×, and 5×) with high‐coverage data (46–68×) of seven bird genomes mapped to a reference suggest that phasing contiguities and accuracies adequate for most population genomic analyses can be reached already with moderate sequencing effort. At 15× coverage, phased haplotypes span about 90% of the genome assembly, with 50% and 90% of phased sequences located in phase blocks longer than 1.25–4.6 Mb (N50) and 0.27–0.72 Mb (N90). Phasing accuracy reaches beyond 99% starting from 15× coverage. Higher coverages yielded higher contiguities (up to about 7 Mb/1 Mb [N50/N90] at 25× coverage), but only marginally improved phasing accuracy. Phase block contiguity improved with input DNA molecule length; thus, higher‐quality DNA may help keeping sequencing costs at bay. In conclusion, even for organisms with gigabase‐sized genomes like birds, linked‐read sequencing at moderate depth opens an affordable avenue towards haplotype‐resolved genome resequencing at population scale. 相似文献

14.

Applications and challenges of next-generation sequencing in Brassica species

Lijuan Wei Meili Xiao Alice Hayward Donghui Fu 《Planta》2013,238(6):1005-1024

相似文献

15.

Whole-genome sequencing and assembly with high-throughput, short-read technologies

Sundquist A Ronaghi M Tang H Pevzner P Batzoglou S 《PloS one》2007,2(5):e484

While recently developed short-read sequencing technologies may dramatically reduce the sequencing cost and eventually achieve the $1000 goal for re-sequencing, their limitations prevent the de novo sequencing of eukaryotic genomes with the standard shotgun sequencing protocol. We present SHRAP (SHort Read Assembly Protocol), a sequencing protocol and assembly methodology that utilizes high-throughput short-read technologies. We describe a variation on hierarchical sequencing with two crucial differences: (1) we select a clone library from the genome randomly rather than as a tiling path and (2) we sample clones from the genome at high coverage and reads from the clones at low coverage. We assume that 200 bp read lengths with a 1% error rate and inexpensive random fragment cloning on whole mammalian genomes is feasible. Our assembly methodology is based on first ordering the clones and subsequently performing read assembly in three stages: (1) local assemblies of regions significantly smaller than a clone size, (2) clone-sized assemblies of the results of stage 1, and (3) chromosome-sized assemblies. By aggressively localizing the assembly problem during the first stage, our method succeeds in assembling short, unpaired reads sampled from repetitive genomes. We tested our assembler using simulated reads from D. melanogaster and human chromosomes 1, 11, and 21, and produced assemblies with large sets of contiguous sequence and a misassembly rate comparable to other draft assemblies. Tested on D. melanogaster and the entire human genome, our clone-ordering method produces accurate maps, thereby localizing fragment assembly and enabling the parallelization of the subsequent steps of our pipeline. Thus, we have demonstrated that truly inexpensive de novo sequencing of mammalian genomes will soon be possible with high-throughput, short-read technologies using our methodology. 相似文献

16.

单分子实时测序技术的原理与应用 总被引：1，自引：0，他引：1

柳延虎王璐于黎《遗传》2015,37(3):259-268

单分子DNA测序技术是近10年发展起来的新一代测序技术,也称为第三代测序技术,包括单分子实时测序、真正单分子测序、单分子纳米孔测序等技术。文章介绍了单分子实时(Single-molecule real-time,SMRT)测序技术的基本原理、性能以及应用。与Sanger测序法和下一代测序技术相比,SMRT测序具有超长读长、测序周期短、无需模板扩增和直接检测表观修饰位点等特点,为研究人员提供了新选择。同时,SMRT测序的低准确率备受争议(约85%),其中约93%的错误是插入缺失,因此,其数据应用于基因组组装前需先对数据进行纠错处理。目前,SMRT测序在小型基因组从头测序和完整组装中已有良好应用,并且已经或将在表观遗传学、转录组学、大型基因组组装等领域发挥其优势,促进基因组学的研究。相似文献

17.

Navigating the tip of the genomic iceberg: Next-generation sequencing for plant systematics

Straub SC Parks M Weitemier K Fishbein M Cronn RC Liston A 《American journal of botany》2012,99(2):349-364

? Premise of the study: Just as Sanger sequencing did more than 20 years ago, next-generation sequencing (NGS) is poised to revolutionize plant systematics. By combining multiplexing approaches with NGS throughput, systematists may no longer need to choose between more taxa or more characters. Here we describe a genome skimming (shallow sequencing) approach for plant systematics. ? Methods: Through simulations, we evaluated optimal sequencing depth and performance of single-end and paired-end short read sequences for assembly of nuclear ribosomal DNA (rDNA) and plastomes and addressed the effect of divergence on reference-guided plastome assembly. We also used simulations to identify potential phylogenetic markers from low-copy nuclear loci at different sequencing depths. We demonstrated the utility of genome skimming through phylogenetic analysis of the Sonoran Desert clade (SDC) of Asclepias (Apocynaceae). ? Key results: Paired-end reads performed better than single-end reads. Minimum sequencing depths for high quality rDNA and plastome assemblies were 40× and 30×, respectively. Divergence from the reference significantly affected plastome assembly, but relatively similar references are available for most seed plants. Deeper rDNA sequencing is necessary to characterize intragenomic polymorphism. The low-copy fraction of the nuclear genome was readily surveyed, even at low sequencing depths. Nearly 160000 bp of sequence from three organelles provided evidence of phylogenetic incongruence in the SDC. ? Conclusions: Adoption of NGS will facilitate progress in plant systematics, as whole plastome and rDNA cistrons, partial mitochondrial genomes, and low-copy nuclear markers can now be efficiently obtained for molecular phylogenetics studies. 相似文献

18.

Identifying the causes and consequences of assembly gaps using a multiplatform genome assembly of a bird‐of‐paradise

Valentina Peona Mozes P. K. Blom Luohao Xu Reto Burri Shawn Sullivan Ignas Bunikis Ivan Liachko Tri Haryoko Knud A. Jnsson Qi Zhou Martin Irestedt Alexander Suh 《Molecular ecology resources》2021,21(1):263-286

Genome assemblies are currently being produced at an impressive rate by consortia and individual laboratories. The low costs and increasing efficiency of sequencing technologies now enable assembling genomes at unprecedented quality and contiguity. However, the difficulty in assembling repeat‐rich and GC‐rich regions (genomic “dark matter”) limits insights into the evolution of genome structure and regulatory networks. Here, we compare the efficiency of currently available sequencing technologies (short/linked/long reads and proximity ligation maps) and combinations thereof in assembling genomic dark matter. By adopting different de novo assembly strategies, we compare individual draft assemblies to a curated multiplatform reference assembly and identify the genomic features that cause gaps within each assembly. We show that a multiplatform assembly implementing long‐read, linked‐read and proximity sequencing technologies performs best at recovering transposable elements, multicopy MHC genes, GC‐rich microchromosomes and the repeat‐rich W chromosome. Telomere‐to‐telomere assemblies are not a reality yet for most organisms, but by leveraging technology choice it is now possible to minimize genome assembly gaps for downstream analysis. We provide a roadmap to tailor sequencing projects for optimized completeness of both the coding and noncoding parts of nonmodel genomes. 相似文献

19.

Elucidation of bacterial genome complexity using next-generation sequencing

Jungkon Kim Sooin Lee HyeonSeok Shin Sun Chang Kim Byung-Kwan Cho 《Biotechnology and Bioprocess Engineering》2012,17(5):887-899

相似文献

20.

SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming

Shreepriya Das Haris Vikalo 《BMC genomics》2015,16(1)

Background

The goal of haplotype assembly is to infer haplotypes of an individual from a mixture of sequenced chromosome fragments. Limited lengths of paired-end sequencing reads and inserts render haplotype assembly computationally challenging; in fact, most of the problem formulations are known to be NP-hard. Dimensions (and, therefore, difficulty) of the haplotype assembly problems keep increasing as the sequencing technology advances and the length of reads and inserts grow. The computational challenges are even more pronounced in the case of polyploid haplotypes, whose assembly is considerably more difficult than in the case of diploids. Fast, accurate, and scalable methods for haplotype assembly of diploid and polyploid organisms are needed.

Results

We develop a novel framework for diploid/polyploid haplotype assembly from high-throughput sequencing data. The method formulates the haplotype assembly problem as a semi-definite program and exploits its special structure – namely, the low rank of the underlying solution – to solve it rapidly and with high accuracy. The developed framework is applicable to both diploid and polyploid species. The code for SDhaP is freely available at https://sourceforge.net/projects/sdhap.

Conclusion

Extensive benchmarking tests on both real and simulated data show that the proposed algorithms outperform several well-known haplotype assembly methods in terms of either accuracy or speed or both. Useful recommendations for coverages needed to achieve near-optimal solutions are also provided. 相似文献