期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem 总被引：8，自引：0，他引：8

Lippert R Schwartz R Lancia G Istrail S 《Briefings in bioinformatics》2002,3(1):23-31

With the consensus human genome sequenced and many other sequencing projects at varying stages of completion, greater attention is being paid to the genetic differences among individuals and the abilities of those differences to predict phenotypes. A significant obstacle to such work is the difficulty and expense of determining haplotypes--sets of variants genetically linked because of their proximity on the genome--for large numbers of individuals for use in association studies. This paper presents some algorithmic considerations in a new approach for haplotype determination: inferring haplotypes from localised polymorphism data gathered from short genome 'fragments.' Formalised models of the biological system under consideration are examined, given a variety of assumptions about the goal of the problem and the character of optimal solutions. Some theoretical results and algorithms for handling haplotype assembly given the different models are then sketched. The primary conclusion is that some important simplified variants of the problem yield tractable problems while more general variants tend to be intractable in the worst case. 相似文献

2.

Effective haplotype assembly via maximum Boolean satisfiability

Mousavi SR Mirabolghasemi M Bargesteh N Talebi M 《Biochemical and biophysical research communications》2011,(2):593-598

The haplotype assembly problem seeks the haplotypes of an individual from which a set of aligned SNP fragments are available. The problem is important as the haplotypes contain all the SNP information, which is essential to such studies as the analysis of the association between specific diseases and their potential genetic causes. Using Minimum Error Correction as the objective function, the problem is NP-hard, which raises the demand for effective yet affordable solutions. In this paper, we propose a new method to solve the problem by providing a novel Max-2-SAT formulation for the problem. The proposed method is compared with several well-known algorithms proposed for the problem in the literature on a recent extensive benchmark, outperforming them all by achieving solutions of higher average quality. 相似文献

3.

SDhaP: haplotype assembly for diploids and polyploids via semi-definite programming

Shreepriya Das Haris Vikalo 《BMC genomics》2015,16(1)

Background

The goal of haplotype assembly is to infer haplotypes of an individual from a mixture of sequenced chromosome fragments. Limited lengths of paired-end sequencing reads and inserts render haplotype assembly computationally challenging; in fact, most of the problem formulations are known to be NP-hard. Dimensions (and, therefore, difficulty) of the haplotype assembly problems keep increasing as the sequencing technology advances and the length of reads and inserts grow. The computational challenges are even more pronounced in the case of polyploid haplotypes, whose assembly is considerably more difficult than in the case of diploids. Fast, accurate, and scalable methods for haplotype assembly of diploid and polyploid organisms are needed.

Results

We develop a novel framework for diploid/polyploid haplotype assembly from high-throughput sequencing data. The method formulates the haplotype assembly problem as a semi-definite program and exploits its special structure – namely, the low rank of the underlying solution – to solve it rapidly and with high accuracy. The developed framework is applicable to both diploid and polyploid species. The code for SDhaP is freely available at https://sourceforge.net/projects/sdhap.

Conclusion

Extensive benchmarking tests on both real and simulated data show that the proposed algorithms outperform several well-known haplotype assembly methods in terms of either accuracy or speed or both. Useful recommendations for coverages needed to achieve near-optimal solutions are also provided. 相似文献

4.

HapCUT: an efficient and accurate algorithm for the haplotype assembly problem

Bansal V Bafna V 《Bioinformatics (Oxford, England)》2008,24(16):i153-i159

相似文献

5.

Feature-by-feature--evaluating de novo sequence assembly

Vezzi F Narzisi G Mishra B 《PloS one》2012,7(2):e31002

The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics and read simulation: while on the one hand, metrics like N50 and number of contigs focus only on size without proportionately emphasizing the information about the correctness of the assembly, comparisons performed on simulated dataset, on the other hand, can be highly biased by the non-realistic assumptions in the underlying read generator. Recently the Feature Response Curve (FRC) method was proposed to assess the overall assembly quality and correctness: FRC transparently captures the trade-offs between contigs' quality against their sizes. Nevertheless, the relationship among the different features and their relative importance remains unknown. In particular, FRC cannot account for the correlation among the different features. We analyzed the correlation among different features in order to better describe their relationships and their importance in gauging assembly quality and correctness. In particular, using multivariate techniques like principal and independent component analysis we were able to estimate the "excess-dimensionality" of the feature space. Moreover, principal component analysis allowed us to show how poorly the acclaimed N50 metric describes the assembly quality. Applying independent component analysis we identified a subset of features that better describe the assemblers performances. We demonstrated that by focusing on a reduced set of highly informative features we can use the FRC curve to better describe and compare the performances of different assemblers. Moreover, as a by-product of our analysis, we discovered how often evaluation based on simulated data, obtained with state of the art simulators, lead to not-so-realistic results. 相似文献

6.

Hap-seqX: Expedite algorithm for haplotype phasing with imputation using sequence data

Dan He Eleazar Eskin 《Gene》2013

Haplotype phasing is one of the most important problems in population genetics as haplotypes can be used to estimate the relatedness of individuals and to impute genotype information which is a commonly performed analysis when searching for variants involved in disease. The problem of haplotype phasing has been well studied. Methodologies for haplotype inference from sequencing data either combine a set of reference haplotypes and collected genotypes using a Hidden Markov Model or assemble haplotypes by overlapping sequencing reads. A recent algorithm Hap-seq considers using both sequencing data and reference haplotypes and it is a hybrid of a dynamic programming algorithm and a Hidden Markov Model (HMM), which is shown to be optimal. However, the algorithm requires extremely large amount of memory which is not practical for whole genome datasets. The current algorithm requires saving intermediate results to disk and reads these results back when needed, which significantly affects the practicality of the algorithm. In this work, we proposed the expedited version of the algorithm Hap-seqX, which addressed the memory issue by using a posterior probability to select the records that should be saved in memory. We show that Hap-seqX can save all the intermediate results in memory and improves the execution time of the algorithm dramatically. Utilizing the strategy, Hap-seqX is able to predict haplotypes from whole genome sequencing data. 相似文献

7.

Targeted assembly of short sequence reads

Warren RL Holt RA 《PloS one》2011,6(5):e19816

As next-generation sequence (NGS) production continues to increase, analysis is becoming a significant bottleneck. However, in situations where information is required only for specific sequence variants, it is not necessary to assemble or align whole genome data sets in their entirety. Rather, NGS data sets can be mined for the presence of sequence variants of interest by localized assembly, which is a faster, easier, and more accurate approach. We present TASR, a streamlined assembler that interrogates very large NGS data sets for the presence of specific variants by only considering reads within the sequence space of input target sequences provided by the user. The NGS data set is searched for reads with an exact match to all possible short words within the target sequence, and these reads are then assembled stringently to generate a consensus of the target and flanking sequence. Typically, variants of a particular locus are provided as different target sequences, and the presence of the variant in the data set being interrogated is revealed by a successful assembly outcome. However, TASR can also be used to find unknown sequences that flank a given target. We demonstrate that TASR has utility in finding or confirming genomic mutations, polymorphisms, fusions and integration events. Targeted assembly is a powerful method for interrogating large data sets for the presence of sequence variants of interest. TASR is a fast, flexible and easy to use tool for targeted assembly. 相似文献

8.

Limitations of next-generation genome sequence assembly 总被引：1，自引：0，他引：1

Alkan C Sajjadian S Eichler EE 《Nature methods》2011,8(1):61-65

High-throughput sequencing technologies promise to transform the fields of genetics and comparative biology by delivering tens of thousands of genomes in the near future. Although it is feasible to construct de novo genome assemblies in a few months, there has been relatively little attention to what is lost by sole application of short sequence reads. We compared the recent de novo assemblies using the short oligonucleotide analysis package (SOAP), generated from the genomes of a Han Chinese individual and a Yoruban individual, to experimentally validated genomic features. We found that de novo assemblies were 16.2% shorter than the reference genome and that 420.2 megabase pairs of common repeats and 99.1% of validated duplicated sequences were missing from the genome. Consequently, over 2,377 coding exons were completely missing. We conclude that high-quality sequencing approaches must be considered in conjunction with high-throughput sequencing for comparative genomics analyses and studies of genome evolution. 相似文献

9.

HLA class II haplotype and sequence analysis support a role for DQ in narcolepsy

Michael C. Ellis A. H. Hetisimer David A. Ruddy Sherry L. Hansen Gregory S. Kronmal Erin McClelland Leah Quintana D. T. Drayna Michael S. Aldrich E. Mignot 《Immunogenetics》1997,46(5):410-417

A systematic haplotype and sequencing analysis of the HLA-DR and -DQ region in patients with narcolepsy was performed. Five new (CA)_n microsatellite markers were generated and positioned on the physical map across the HLA-DQB1-DQA1-DRB1 interval. Haplotypes for these new markers and the three HLA loci were established using somatic cell hybrids generated from patients. A four-marker haplotype surrounding the DQB1 ^* 0602 gene was found in all narcolepsy patients, and was identical to haplotypes observed on random chromosomes harboring the DQB1 ^* 0602 allele. Eighty-six kilobases of contiguous genomic sequence across the region did not reveal new genes, and analysis of this sequence for single nucleotide polymorphisms did not reveal sequence variation among DQB1 ^* 0602 chromosomes. These results are consistent with other studies, suggesting that the HLA-DQ genes themselves are among the predisposing factors in narcolepsy. Received: 18 March 1997 / Revised: 23 April 1997 相似文献

10.

A new DNA sequence assembly program. 总被引：52，自引：3，他引：49

下载免费PDF全文

J K Bonfield K f Smith R Staden 《Nucleic acids research》1995,23(24):4992-4999

We describe the Genome Assembly Program (GAP), a new program for DNA sequence assembly. The program is suitable for large and small projects, a variety of strategies and can handle data from a range of sequencing instruments. It retains the useful components of our previous work, but includes many novel ideas and methods. Many of these methods have been made possible by the program's completely new, and highly interactive, graphical user interface. The program provides many visual clues to the current state of a sequencing project and allows users to interact in intuitive and graphical ways with their data. The program has tools to display and manipulate the various types of data that help to solve and check difficult assemblies, particularly those in repetitive genomes. We have introduced the following new displays: the Contig Selector, the Contig Comparator, the Template Display, the Restriction Enzyme Map and the Stop Codon Map. We have also made it possible to have any number of Contig Editors and Contig Joining Editors running simultaneously even on the same contig. The program also includes a new 'Directed Assembly' algorithm and routines for automatically detecting unfinished segments of sequence, to which it suggests experimental solutions. 相似文献

11.

Effective scoring function for protein sequence design

Liang S Grishin NV 《Proteins》2004,54(2):271-281

We have developed an effective scoring function for protein design. The atomic solvation parameters, together with the weights of energy terms, were optimized so that residues corresponding to the native sequence were predicted with low energy in the training set of 28 protein structures. The solvation energy of non-hydrogen-bonded hydrophilic atoms was considered separately and expressed in a nonlinear way. As a result, our scoring function predicted native residues as the most favorable in 59% of the total positions in 28 proteins. We then tested the scoring function by comparing the predicted stability changes for 103 T4 lysozyme mutants with the experimental values. The correlation coefficients were 0.77 for surface mutations and 0.71 for all mutations. Finally, the scoring function combined with Monte Carlo simulation was used to predict favorable sequences on a fixed backbone. The designed sequences were similar to the natural sequences of the family to which the template structure belonged. The profile of the designed sequences was helpful for identification of remote homologues of the native sequence. 相似文献

12.

Genome sequence and assembly of Bos indicus

Canavez FC Luche DD Stothard P Leite KR Sousa-Canavez JM Plastow G Meidanis J Souza MA Feijao P Moore SS Camara-Lopes LH 《The Journal of heredity》2012,103(3):342-348

Cattle are divided into 2 groups referred to as taurine and indicine, both of which have been under strong artificial selection due to their importance for human nutrition. A side effect of this domestication includes a loss of genetic diversity within each specialized breed. Recently, the first taurine genome was sequenced and assembled, allowing for a better understanding of this ruminant species. However, genetic information from indicine breeds has been limited. Here, we present the first genome sequence of an indicine breed (Nellore) generated with 52X coverage by SOLiD sequencing platform. As expected, both genomes share high similarity at the nucleotide level for all autosomes and the X chromosome. Regarding the Y chromosome, the homology was considerably lower, most likely due to uncompleted assembly of the taurine Y chromosome. We were also able to cover 97% of the annotated taurine protein-coding genes. 相似文献

13.

A novel HLA-DR β I sequence from the DRw11 haplotype

Viktor Steimle Ari Hinkkanen Michael Schlesier Joerg T. Epplen 《Immunogenetics》1988,28(3):208-210

相似文献

14.

Effective cluster-based seed design for cross-species sequence comparisons

Zhou L Mihai I Florea L 《Bioinformatics (Oxford, England)》2008,24(24):2926-2927

SUMMARY: To annotate newly sequenced organisms, cross-species sequence comparison algorithms can be applied to align gene sequences to the genome of a related species. To improve the accuracy of alignment, spaced seeds must be optimized for each comparison. As the number and diversity of genomes increase, an efficient alternative is to cluster pairwise comparisons into groups and identify seeds for groups instead of individual comparisons. Here we investigate a measure of comparison closeness and identify classes of comparisons that show similar seed behavior and therefore can employ the same seed. AVAILABILITY: Source code is freely available at http://dna.cs.gwu.edu and from Bioinformatics online. 相似文献

15.

面向新一代基因组测序技术的序列拼接算法

逯雯雯卢志远王亚旭孙啸《生物信息学》2010,8(3):248-253

随着新一代测序技术的发展,新的拼接算法应运而生。介绍了目前国际上广泛认可的几种新的拼接算法的基本原理与具体步骤,分析每种算法的优缺点以及适用范围。用Helicobacter acinonychis的Illumina 1G测序数据检测SSAKE,VCAKE,SHARCGS以及velvet的性能,并对未来拼接算法的研究提出展望。相似文献

16.

Coiled-coil assembly by peptides with non-heptad sequence motifs

《Folding and Design》1997,2(3):149-158

相似文献

17.

Optimal reference sequence selection for genome assembly using minimum description length principle

Bilal Wajid Erchin Serpedin Mohamed Nounou Hazem Nounou 《EURASIP Journal on Bioinformatics and Systems Biology》2012,2012(1):18

相似文献

18.

Equipment selection and task assignment for multiproduct assembly system design

Stephen C. Graves Carol Holmes Redfield 《Flexible Services and Manufacturing Journal》1988,1(1):31-50

A multiproduct assembly system produces a family of similar products, where the assembly of each product entails an ordered set of tasks. An assembly system consists of a sequence of workstations. For each workstation, we assign a subset of the assembly tasks to be performed at the workstation and select the type of assembly equipment or resource to be used by the workstation. The assembly of each product requires a visit to each workstation in the fixed sequence. The problem of system design is to find the system that is capable of producing all the products in the desired volumes at minimum cost. The system cost includes the fixed capital costs for the assembly equipment and tools and the variable operating costs for the various workstations. We present and illustrate an optimization procedure that assigns tasks to workstations and selects assembly equipment for each workstation. 相似文献

19.

A full-automatic sequence design algorithm for branched DNA structures

Seiffert J Huhle A 《Journal of biomolecular structure & dynamics》2008,25(5):453-466

Production of various structures by self-assembling single stranded DNA molecules is a widely used technology in the filed of DNA nanotechnology. Base sequences of single strands do predict the shape of the resulting nanostructure. Therefore, sequence design is crucial for the successful structure fabrication. This paper presents a sequence design algorithm based on mismatch minimization that can be applied to every desired DNA structure. With this algorithm, junctions, loops, single as well as double stranded regions, and very large structures up to several thousand base pairs can be handled. Thereby, the algorithm is fast for the most structures. Algorithm is Java-implemented. Its implementation is called Seed and is available publicly. As an example for a successful sequence generation, this paper presents the fabrication of DNA chain molecules consisting of double-crossover (DX) tiles as well. 相似文献

20.

Nucleotide sequence of a rabbit IgG heavy chain from the recombinant F-I haplotype 总被引：1，自引：0，他引：1

Kenneth E. Bernstein Cornelius B. Alexander Rose G. Mage 《Immunogenetics》1983,18(4):387-397

We report the sequence of a cDNA encoding a rabbit immunoglobulin gamma heavy chain of d12 and e14 allotypes with high homology to partial cDNA sequences from rabbits of d11 and e15 allotypes. The encoded rabbit protein shows homologies with human (68-70%) and mouse (60-63%) gamma chains. The nucleotide sequence homologies of the CH domains range from 76-84% with human and 64-76% with mouse sequences. Comparison of the portion of VH encoding amino acid positions 34-112 with a previously determined VH sequence of the same allotype shows high conservation of sequences in the second and third framework segments but more marked differences both in length and encoded amino acids of the second and third complementarity-determining regions (CDRs). We also found a high degree of homology with a human genomic V-region, VH26 (77%) and a remarkable similarity between rabbit and human second CDR sequences and human genomic D minigenes. These results provide additional evidence that D minigene sequences share information with the CDR2 portion of VH regions. 相似文献