首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
With the consensus human genome sequenced and many other sequencing projects at varying stages of completion, greater attention is being paid to the genetic differences among individuals and the abilities of those differences to predict phenotypes. A significant obstacle to such work is the difficulty and expense of determining haplotypes--sets of variants genetically linked because of their proximity on the genome--for large numbers of individuals for use in association studies. This paper presents some algorithmic considerations in a new approach for haplotype determination: inferring haplotypes from localised polymorphism data gathered from short genome 'fragments.' Formalised models of the biological system under consideration are examined, given a variety of assumptions about the goal of the problem and the character of optimal solutions. Some theoretical results and algorithms for handling haplotype assembly given the different models are then sketched. The primary conclusion is that some important simplified variants of the problem yield tractable problems while more general variants tend to be intractable in the worst case.  相似文献   

2.
The haplotype assembly problem seeks the haplotypes of an individual from which a set of aligned SNP fragments are available. The problem is important as the haplotypes contain all the SNP information, which is essential to such studies as the analysis of the association between specific diseases and their potential genetic causes. Using Minimum Error Correction as the objective function, the problem is NP-hard, which raises the demand for effective yet affordable solutions. In this paper, we propose a new method to solve the problem by providing a novel Max-2-SAT formulation for the problem. The proposed method is compared with several well-known algorithms proposed for the problem in the literature on a recent extensive benchmark, outperforming them all by achieving solutions of higher average quality.  相似文献   

3.

Background

The goal of haplotype assembly is to infer haplotypes of an individual from a mixture of sequenced chromosome fragments. Limited lengths of paired-end sequencing reads and inserts render haplotype assembly computationally challenging; in fact, most of the problem formulations are known to be NP-hard. Dimensions (and, therefore, difficulty) of the haplotype assembly problems keep increasing as the sequencing technology advances and the length of reads and inserts grow. The computational challenges are even more pronounced in the case of polyploid haplotypes, whose assembly is considerably more difficult than in the case of diploids. Fast, accurate, and scalable methods for haplotype assembly of diploid and polyploid organisms are needed.

Results

We develop a novel framework for diploid/polyploid haplotype assembly from high-throughput sequencing data. The method formulates the haplotype assembly problem as a semi-definite program and exploits its special structure – namely, the low rank of the underlying solution – to solve it rapidly and with high accuracy. The developed framework is applicable to both diploid and polyploid species. The code for SDhaP is freely available at https://sourceforge.net/projects/sdhap.

Conclusion

Extensive benchmarking tests on both real and simulated data show that the proposed algorithms outperform several well-known haplotype assembly methods in terms of either accuracy or speed or both. Useful recommendations for coverages needed to achieve near-optimal solutions are also provided.  相似文献   

4.
5.
Vezzi F  Narzisi G  Mishra B 《PloS one》2012,7(2):e31002
The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics and read simulation: while on the one hand, metrics like N50 and number of contigs focus only on size without proportionately emphasizing the information about the correctness of the assembly, comparisons performed on simulated dataset, on the other hand, can be highly biased by the non-realistic assumptions in the underlying read generator. Recently the Feature Response Curve (FRC) method was proposed to assess the overall assembly quality and correctness: FRC transparently captures the trade-offs between contigs' quality against their sizes. Nevertheless, the relationship among the different features and their relative importance remains unknown. In particular, FRC cannot account for the correlation among the different features. We analyzed the correlation among different features in order to better describe their relationships and their importance in gauging assembly quality and correctness. In particular, using multivariate techniques like principal and independent component analysis we were able to estimate the "excess-dimensionality" of the feature space. Moreover, principal component analysis allowed us to show how poorly the acclaimed N50 metric describes the assembly quality. Applying independent component analysis we identified a subset of features that better describe the assemblers performances. We demonstrated that by focusing on a reduced set of highly informative features we can use the FRC curve to better describe and compare the performances of different assemblers. Moreover, as a by-product of our analysis, we discovered how often evaluation based on simulated data, obtained with state of the art simulators, lead to not-so-realistic results.  相似文献   

6.
Haplotype phasing is one of the most important problems in population genetics as haplotypes can be used to estimate the relatedness of individuals and to impute genotype information which is a commonly performed analysis when searching for variants involved in disease. The problem of haplotype phasing has been well studied. Methodologies for haplotype inference from sequencing data either combine a set of reference haplotypes and collected genotypes using a Hidden Markov Model or assemble haplotypes by overlapping sequencing reads. A recent algorithm Hap-seq considers using both sequencing data and reference haplotypes and it is a hybrid of a dynamic programming algorithm and a Hidden Markov Model (HMM), which is shown to be optimal. However, the algorithm requires extremely large amount of memory which is not practical for whole genome datasets. The current algorithm requires saving intermediate results to disk and reads these results back when needed, which significantly affects the practicality of the algorithm. In this work, we proposed the expedited version of the algorithm Hap-seqX, which addressed the memory issue by using a posterior probability to select the records that should be saved in memory. We show that Hap-seqX can save all the intermediate results in memory and improves the execution time of the algorithm dramatically. Utilizing the strategy, Hap-seqX is able to predict haplotypes from whole genome sequencing data.  相似文献   

7.
Warren RL  Holt RA 《PloS one》2011,6(5):e19816
As next-generation sequence (NGS) production continues to increase, analysis is becoming a significant bottleneck. However, in situations where information is required only for specific sequence variants, it is not necessary to assemble or align whole genome data sets in their entirety. Rather, NGS data sets can be mined for the presence of sequence variants of interest by localized assembly, which is a faster, easier, and more accurate approach. We present TASR, a streamlined assembler that interrogates very large NGS data sets for the presence of specific variants by only considering reads within the sequence space of input target sequences provided by the user. The NGS data set is searched for reads with an exact match to all possible short words within the target sequence, and these reads are then assembled stringently to generate a consensus of the target and flanking sequence. Typically, variants of a particular locus are provided as different target sequences, and the presence of the variant in the data set being interrogated is revealed by a successful assembly outcome. However, TASR can also be used to find unknown sequences that flank a given target. We demonstrate that TASR has utility in finding or confirming genomic mutations, polymorphisms, fusions and integration events. Targeted assembly is a powerful method for interrogating large data sets for the presence of sequence variants of interest. TASR is a fast, flexible and easy to use tool for targeted assembly.  相似文献   

8.
Limitations of next-generation genome sequence assembly   总被引:1,自引:0,他引:1  
High-throughput sequencing technologies promise to transform the fields of genetics and comparative biology by delivering tens of thousands of genomes in the near future. Although it is feasible to construct de novo genome assemblies in a few months, there has been relatively little attention to what is lost by sole application of short sequence reads. We compared the recent de novo assemblies using the short oligonucleotide analysis package (SOAP), generated from the genomes of a Han Chinese individual and a Yoruban individual, to experimentally validated genomic features. We found that de novo assemblies were 16.2% shorter than the reference genome and that 420.2 megabase pairs of common repeats and 99.1% of validated duplicated sequences were missing from the genome. Consequently, over 2,377 coding exons were completely missing. We conclude that high-quality sequencing approaches must be considered in conjunction with high-throughput sequencing for comparative genomics analyses and studies of genome evolution.  相似文献   

9.
 A systematic haplotype and sequencing analysis of the HLA-DR and -DQ region in patients with narcolepsy was performed. Five new (CA)n microsatellite markers were generated and positioned on the physical map across the HLA-DQB1-DQA1-DRB1 interval. Haplotypes for these new markers and the three HLA loci were established using somatic cell hybrids generated from patients. A four-marker haplotype surrounding the DQB1 * 0602 gene was found in all narcolepsy patients, and was identical to haplotypes observed on random chromosomes harboring the DQB1 * 0602 allele. Eighty-six kilobases of contiguous genomic sequence across the region did not reveal new genes, and analysis of this sequence for single nucleotide polymorphisms did not reveal sequence variation among DQB1 * 0602 chromosomes. These results are consistent with other studies, suggesting that the HLA-DQ genes themselves are among the predisposing factors in narcolepsy. Received: 18 March 1997 / Revised: 23 April 1997  相似文献   

10.
A new DNA sequence assembly program.   总被引:52,自引:3,他引:49       下载免费PDF全文
We describe the Genome Assembly Program (GAP), a new program for DNA sequence assembly. The program is suitable for large and small projects, a variety of strategies and can handle data from a range of sequencing instruments. It retains the useful components of our previous work, but includes many novel ideas and methods. Many of these methods have been made possible by the program's completely new, and highly interactive, graphical user interface. The program provides many visual clues to the current state of a sequencing project and allows users to interact in intuitive and graphical ways with their data. The program has tools to display and manipulate the various types of data that help to solve and check difficult assemblies, particularly those in repetitive genomes. We have introduced the following new displays: the Contig Selector, the Contig Comparator, the Template Display, the Restriction Enzyme Map and the Stop Codon Map. We have also made it possible to have any number of Contig Editors and Contig Joining Editors running simultaneously even on the same contig. The program also includes a new 'Directed Assembly' algorithm and routines for automatically detecting unfinished segments of sequence, to which it suggests experimental solutions.  相似文献   

11.
Liang S  Grishin NV 《Proteins》2004,54(2):271-281
We have developed an effective scoring function for protein design. The atomic solvation parameters, together with the weights of energy terms, were optimized so that residues corresponding to the native sequence were predicted with low energy in the training set of 28 protein structures. The solvation energy of non-hydrogen-bonded hydrophilic atoms was considered separately and expressed in a nonlinear way. As a result, our scoring function predicted native residues as the most favorable in 59% of the total positions in 28 proteins. We then tested the scoring function by comparing the predicted stability changes for 103 T4 lysozyme mutants with the experimental values. The correlation coefficients were 0.77 for surface mutations and 0.71 for all mutations. Finally, the scoring function combined with Monte Carlo simulation was used to predict favorable sequences on a fixed backbone. The designed sequences were similar to the natural sequences of the family to which the template structure belonged. The profile of the designed sequences was helpful for identification of remote homologues of the native sequence.  相似文献   

12.
Cattle are divided into 2 groups referred to as taurine and indicine, both of which have been under strong artificial selection due to their importance for human nutrition. A side effect of this domestication includes a loss of genetic diversity within each specialized breed. Recently, the first taurine genome was sequenced and assembled, allowing for a better understanding of this ruminant species. However, genetic information from indicine breeds has been limited. Here, we present the first genome sequence of an indicine breed (Nellore) generated with 52X coverage by SOLiD sequencing platform. As expected, both genomes share high similarity at the nucleotide level for all autosomes and the X chromosome. Regarding the Y chromosome, the homology was considerably lower, most likely due to uncompleted assembly of the taurine Y chromosome. We were also able to cover 97% of the annotated taurine protein-coding genes.  相似文献   

13.
14.
SUMMARY: To annotate newly sequenced organisms, cross-species sequence comparison algorithms can be applied to align gene sequences to the genome of a related species. To improve the accuracy of alignment, spaced seeds must be optimized for each comparison. As the number and diversity of genomes increase, an efficient alternative is to cluster pairwise comparisons into groups and identify seeds for groups instead of individual comparisons. Here we investigate a measure of comparison closeness and identify classes of comparisons that show similar seed behavior and therefore can employ the same seed. AVAILABILITY: Source code is freely available at http://dna.cs.gwu.edu and from Bioinformatics online.  相似文献   

15.
随着新一代测序技术的发展,新的拼接算法应运而生。介绍了目前国际上广泛认可的几种新的拼接算法的基本原理与具体步骤,分析每种算法的优缺点以及适用范围。用Helicobacter acinonychis的Illumina 1G测序数据检测SSAKE,VCAKE,SHARCGS以及velvet的性能,并对未来拼接算法的研究提出展望。  相似文献   

16.
17.
18.
A multiproduct assembly system produces a family of similar products, where the assembly of each product entails an ordered set of tasks. An assembly system consists of a sequence of workstations. For each workstation, we assign a subset of the assembly tasks to be performed at the workstation and select the type of assembly equipment or resource to be used by the workstation. The assembly of each product requires a visit to each workstation in the fixed sequence. The problem of system design is to find the system that is capable of producing all the products in the desired volumes at minimum cost. The system cost includes the fixed capital costs for the assembly equipment and tools and the variable operating costs for the various workstations. We present and illustrate an optimization procedure that assigns tasks to workstations and selects assembly equipment for each workstation.  相似文献   

19.
Production of various structures by self-assembling single stranded DNA molecules is a widely used technology in the filed of DNA nanotechnology. Base sequences of single strands do predict the shape of the resulting nanostructure. Therefore, sequence design is crucial for the successful structure fabrication. This paper presents a sequence design algorithm based on mismatch minimization that can be applied to every desired DNA structure. With this algorithm, junctions, loops, single as well as double stranded regions, and very large structures up to several thousand base pairs can be handled. Thereby, the algorithm is fast for the most structures. Algorithm is Java-implemented. Its implementation is called Seed and is available publicly. As an example for a successful sequence generation, this paper presents the fabrication of DNA chain molecules consisting of double-crossover (DX) tiles as well.  相似文献   

20.
We report the sequence of a cDNA encoding a rabbit immunoglobulin gamma heavy chain of d12 and e14 allotypes with high homology to partial cDNA sequences from rabbits of d11 and e15 allotypes. The encoded rabbit protein shows homologies with human (68-70%) and mouse (60-63%) gamma chains. The nucleotide sequence homologies of the CH domains range from 76-84% with human and 64-76% with mouse sequences. Comparison of the portion of VH encoding amino acid positions 34-112 with a previously determined VH sequence of the same allotype shows high conservation of sequences in the second and third framework segments but more marked differences both in length and encoded amino acids of the second and third complementarity-determining regions (CDRs). We also found a high degree of homology with a human genomic V-region, VH26 (77%) and a remarkable similarity between rabbit and human second CDR sequences and human genomic D minigenes. These results provide additional evidence that D minigene sequences share information with the CDR2 portion of VH regions.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号