首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Recent improvements in technology have made DNA sequencing dramatically faster and more efficient than ever before. The new technologies produce highly accurate sequences, but one drawback is that the most efficient technology produces the shortest read lengths. Short-read sequencing has been applied successfully to resequence the human genome and those of other species but not to whole-genome sequencing of novel organisms. Here we describe the sequencing and assembly of a novel clinical isolate of Pseudomonas aeruginosa, strain PAb1, using very short read technology. From 8,627,900 reads, each 33 nucleotides in length, we assembled the genome into one scaffold of 76 ordered contiguous sequences containing 6,290,005 nucleotides, including one contig spanning 512,638 nucleotides, plus an additional 436 unordered contigs containing 416,897 nucleotides. Our method includes a novel gene-boosting algorithm that uses amino acid sequences from predicted proteins to build a better assembly. This study demonstrates the feasibility of very short read sequencing for the sequencing of bacterial genomes, particularly those for which a related species has been sequenced previously, and expands the potential application of this new technology to most known prokaryotic species.  相似文献   

2.
Extending assembly of short DNA sequences to handle error   总被引:2,自引:0,他引:2  
Inexpensive de novo genome sequencing, particularly in organisms with small genomes, is now possible using several new sequencing technologies. Some of these technologies such as that from Illumina's Solexa Sequencing, produce high genomic coverage by generating a very large number of small reads ( approximately 30 bp). While prior work shows that partial assembly can be performed by k-mer extension in error-free reads, this algorithm is unsuccessful with the sequencing error rates found in practice. We present VCAKE (Verified Consensus Assembly by K-mer Extension), a modification of simple k-mer extension that overcomes error by using high depth coverage. Though it is a simple modification of a previous approach, we show significant improvements in assembly results on simulated and experimental datasets that include error. AVAILABILITY: http://152.2.15.114/~labweb/VCAKE  相似文献   

3.
The whole genome shotgun approach to genome sequencing results in a collection of contigs that must be ordered and oriented to facilitate efficient gap closure. We present a new tool OSLay that uses synteny between matching sequences in a target assembly and a reference assembly to layout the contigs (or scaffolds) in the target assembly. The underlying algorithm is based on maximum weight matching. The tool provides an interactive visualization of the computed layout and the result can be imported into the assembly editing tool Consed to support the design of primer pairs for gap closure. MOTIVATION: To enhance efficiency in the gap closure phase of a genome project it is crucial to know which contigs are adjacent in the target genome. Related genome sequences can be used to layout contigs in an assembly. AVAILABILITY: OSLay is freely available from: http://www-ab.informatik.unituebingen.de/software/oslay.  相似文献   

4.
Mixed-model assembly lines are becoming increasingly interesting as the use of just-in-time principles in manufacturing continues to expand. This paper presents, for the first time, complexity results for the underlying problem of product sequencing. It is shown that the problem is intractable for both the single station and the multiple station case. Nevertheless, efficient 1.5-approximation algorithms are developed for the early- and late-start models associated with the former case. Empirical results demonstrate that these algorithms perform extremely well in practice.  相似文献   

5.
Background: Next-generation sequencing (NGS) technologies have fostered an unprecedented proliferation of high-throughput sequencing projects and a concomitant development of novel algorithms for the assembly of short reads. However, numerous technical or computational challenges in de novo assembly still remain, although many new ideas and solutions have been suggested to tackle the challenges in both experimental and computational settings.Results: In this review, we first briefly introduce some of the major challenges faced by NGS sequence assembly. Then, we analyze the characteristics of various sequencing platforms and their impact on assembly results. After that, we classify de novo assemblers according to their frameworks (overlap graph-based, de Bruijn graph-based and string graph-based), and introduce the characteristics of each assembly tool and their adaptation scene. Next, we introduce in detail the solutions to the main challenges of de novo assembly of next generation sequencing data, single-cell sequencing data and single molecule sequencing data. At last, we discuss the application of SMS long reads in solving problems encountered in NGS assembly.Conclusions: This review not only gives an overview of the latest methods and developments in assembly algorithms, but also provides guidelines to determine the optimal assembly algorithm for a given input sequencing data type.  相似文献   

6.
Although recursion has been hypothesized to be a necessary capacity for the evolution of language, the multiplicity of definitions being used has undermined the broader interpretation of empirical results. I propose that only a definition focused on representational abilities allows the prediction of specific behavioural traits that enable us to distinguish recursion from non-recursive iteration and from hierarchical embedding: only subjects able to represent recursion, i.e. to represent different hierarchical dependencies (related by parenthood) with the same set of rules, are able to generalize and produce new levels of embedding beyond those specified a priori (in the algorithm or in the input). The ability to use such representations may be advantageous in several domains: action sequencing, problem-solving, spatial navigation, social navigation and for the emergence of conventionalized communication systems. The ability to represent contiguous hierarchical levels with the same rules may lead subjects to expect unknown levels and constituents to behave similarly, and this prior knowledge may bias learning positively. Finally, a new paradigm to test for recursion is presented. Preliminary results suggest that the ability to represent recursion in the spatial domain recruits both visual and verbal resources. Implications regarding language evolution are discussed.  相似文献   

7.
Hierarchical shotgun sequencing remains the method of choice for assembling high‐quality reference sequences of complex plant genomes. The efficient exploitation of current high‐throughput technologies and powerful computational facilities for large‐insert clone sequencing necessitates the sequencing and assembly of a large number of clones in parallel. We developed a multiplexed pipeline for shotgun sequencing and assembling individual bacterial artificial chromosomes (BACs) using the Illumina sequencing platform. We illustrate our approach by sequencing 668 barley BACs (Hordeum vulgare L.) in a single Illumina HiSeq 2000 lane. Using a newly designed parallelized computational pipeline, we obtained sequence assemblies of individual BACs that consist, on average, of eight sequence scaffolds and represent >98% of the genomic inserts. Our BAC assemblies are clearly superior to a whole‐genome shotgun assembly regarding contiguity, completeness and the representation of the gene space. Our methods may be employed to rapidly obtain high‐quality assemblies of a large number of clones to assemble map‐based reference sequences of plant and animal species with complex genomes by sequencing along a minimum tiling path.  相似文献   

8.
Flexible Assembly Systems (FASs), which form an important subset of modern manufacturing systems, are finding increasing use in today's industry. In the planning and design phase of these systems, it is useful to have tools that predict system performance for various operating conditions. In this article, we present such a performance analysis tool based on queueing approximation for a class of FASs, namely, closed-loop flexible assembly systems (CL-FASs). For CL-FASs, we describe iterative algorithms for computing steady-state performance measures, including production rate and station utilizations. These algorithms are computationally simple and have a fast convergence rate. We derive a new approximation to correct the mean delay at each queue. This improves the accuracy of performance prediction, especially in the case of small CL-FASs. Comparisons with simulation results indicate that the approximation technique is reasonably accurate for a broad range of parameter values and system sizes. This makes possible efficient (fast and computationally inexpensive) analysis of CL-FASs under various conditions.  相似文献   

9.
A simple protocol for rapid assembly of chemically synthesized deoxyoligonucleotides into double stranded DNA is described. Several parameters of a ligation-free method were investigated to allow efficient assembly of a large number of oligonucleotides into double stranded DNA by polymerase chain reaction. Synthesis of a 701 bp DNA was carried out in a single reaction by assembling 28 oligonucleotides designed with partial overlaps at complementary ends. An estimate of error rate was made by sequencing several independent clones of the synthesized DNA  相似文献   

10.
As a result of improvements in genome assembly algorithms and the ever decreasing costs of high-throughput sequencing technologies, new high quality draft genome sequences are published at a striking pace. With well-established methodologies, larger and more complex genomes are being tackled, including polyploid plant genomes. Given the similarity between multiple copies of a basic genome in polyploid individuals, assembly of such data usually results in collapsed contigs that represent a variable number of homoeologous genomic regions. Unfortunately, such collapse is often not ideal, as keeping contigs separate can lead both to improved assembly and also insights about how haplotypes influence phenotype. Here, we describe a first step in avoiding inappropriate collapse during assembly. In particular, we describe ConPADE (Contig Ploidy and Allele Dosage Estimation), a probabilistic method that estimates the ploidy of any given contig/scaffold based on its allele proportions. In the process, we report findings regarding errors in sequencing. The method can be used for whole genome shotgun (WGS) sequencing data. We also show applicability of the method for variant calling and allele dosage estimation. Results for simulated and real datasets are discussed and provide evidence that ConPADE performs well as long as enough sequencing coverage is available, or the true contig ploidy is low. We show that ConPADE may also be used for related applications, such as the identification of duplicated genes in fragmented assemblies, although refinements are needed.  相似文献   

11.
Extracting a subset of informative genes from microarray expression data is a critical data preparation step in cancer classification and other biological function analyses. Though many algorithms have been developed, the Support Vector Machine - Recursive Feature Elimination (SVM-RFE) algorithm is one of the best gene feature selection algorithms. It assumes that a smaller "filter-out" factor in the SVM-RFE, which results in a smaller number of gene features eliminated in each recursion, should lead to extraction of a better gene subset. Because the SVM-RFE is highly sensitive to the "filter-out" factor, our simulations have shown that this assumption is not always correct and that the SVM-RFE is an unstable algorithm. To select a set of key gene features for reliable prediction of cancer types or subtypes and other applications, a new two-stage SVM-RFE algorithm has been developed. It is designed to effectively eliminate most of the irrelevant, redundant and noisy genes while keeping information loss small at the first stage. A fine selection for the final gene subset is then performed at the second stage. The two-stage SVM-RFE overcomes the instability problem of the SVM-RFE to achieve better algorithm utility. We have demonstrated that the two-stage SVM-RFE is significantly more accurate and more reliable than the SVM-RFE and three correlation-based methods based on our analysis of three publicly available microarray expression datasets. Furthermore, the two-stage SVM-RFE is computationally efficient because its time complexity is O(d*log(2)d}, where d is the size of the original gene set.  相似文献   

12.
Yoon  Byung-Jun  Qian  Xiaoning  Kahveci  Tamer  Pal  Ranadip 《BMC genomics》2020,21(9):1-3
Background

Haplotypes, the ordered lists of single nucleotide variations that distinguish chromosomal sequences from their homologous pairs, may reveal an individual’s susceptibility to hereditary and complex diseases and affect how our bodies respond to therapeutic drugs. Reconstructing haplotypes of an individual from short sequencing reads is an NP-hard problem that becomes even more challenging in the case of polyploids. While increasing lengths of sequencing reads and insert sizes helps improve accuracy of reconstruction, it also exacerbates computational complexity of the haplotype assembly task. This has motivated the pursuit of algorithmic frameworks capable of accurate yet efficient assembly of haplotypes from high-throughput sequencing data.

Results

We propose a novel graphical representation of sequencing reads and pose the haplotype assembly problem as an instance of community detection on a spatial random graph. To this end, we construct a graph where each read is a node with an unknown community label associating the read with the haplotype it samples. Haplotype reconstruction can then be thought of as a two-step procedure: first, one recovers the community labels on the nodes (i.e., the reads), and then uses the estimated labels to assemble the haplotypes. Based on this observation, we propose ComHapDet – a novel assembly algorithm for diploid and ployploid haplotypes which allows both bialleleic and multi-allelic variants.

Conclusions

Performance of the proposed algorithm is benchmarked on simulated as well as experimental data obtained by sequencing Chromosome 5 of tetraploid biallelic Solanum-Tuberosum (Potato). The results demonstrate the efficacy of the proposed method and that it compares favorably with the existing techniques.

  相似文献   

13.
A divide-and-conquer approach to fragment assembly   总被引:1,自引:0,他引:1  
MOTIVATION: One of the major problems in DNA sequencing is assembling the fragments obtained by shotgun sequencing. Most existing fragment assembly techniques follow the overlap-layout-consensus approach. This framework requires extensive computation in each phase and becomes inefficient with increasing number of fragments. RESULTS: We propose a new algorithm which solves the overlap, layout, and consensus phases simultaneously. The fragments are clustered with respect to their Average Mutual Information (AMI) profiles using the k-means algorithm. This removes the unnecessary burden of considering the collection of fragments as a whole. Instead, the orientation and overlap detection are solved efficiently, within the clusters. The algorithm has successfully reconstructed both artificial and real data. Availability: Available on request from the authors.  相似文献   

14.
A manual Edman technique is described which allows sequential quantitative determination of from 3 to 10 amino terminal residues on quantities of peptides or proteins down to one nanomole. This is achieved by a fast, efficient method of obtaining the anilinothiazolinone or phenylthiohydantoin amino acid, and quantitating by either back hydrolysis and amino acid analysis or by a new, rapid, high resolution, quasi-isocratic, high-pressure liquid chromatographic procedure. The overall method has been extensively tested successfully on both peptides and proteins of known and unknown amino-terminal sequence and the results included here. In addition, a wide variety of applications relevant to primary structure analysis such as sequencing blocked polypeptides, use of denaturing agents as coupling buffers, reduction of protein or peptide losses on consecutive sequencing and peptide mixture analysis are all incorporated in the methodology outlined.  相似文献   

15.
For the identification of novel proteins using MS/MS, de novo sequencing software computes one or several possible amino acid sequences (called sequence tags) for each MS/MS spectrum. Those tags are then used to match, accounting amino acid mutations, the sequences in a protein database. If the de novo sequencing gives correct tags, the homologs of the proteins can be identified by this approach and software such as MS-BLAST is available for the matching. However, de novo sequencing very often gives only partially correct tags. The most common error is that a segment of amino acids is replaced by another segment with approximately the same masses. We developed a new efficient algorithm to match sequence tags with errors to database sequences for the purpose of protein and peptide identification. A software package, SPIDER, was developed and made available on Internet for free public use. This paper describes the algorithms and features of the SPIDER software.  相似文献   

16.
随着新一代测序技术的发展,新的拼接算法应运而生。介绍了目前国际上广泛认可的几种新的拼接算法的基本原理与具体步骤,分析每种算法的优缺点以及适用范围。用Helicobacter acinonychis的Illumina 1G测序数据检测SSAKE,VCAKE,SHARCGS以及velvet的性能,并对未来拼接算法的研究提出展望。  相似文献   

17.
Xing J  Wang H  Oster G 《Biophysical journal》2005,89(3):1551-1563
Two theoretical formalisms are widely used in modeling mechanochemical systems such as protein motors: continuum Fokker-Planck models and discrete kinetic models. Both have advantages and disadvantages. Here we present a "finite volume" procedure to solve Fokker-Planck equations. The procedure relates the continuum equations to a discrete mechanochemical kinetic model while retaining many of the features of the continuum formulation. The resulting numerical algorithm is a generalization of the algorithm developed previously by Fricks, Wang, and Elston through relaxing the local linearization approximation of the potential functions, and a more accurate treatment of chemical transitions. The new algorithm dramatically reduces the number of numerical cells required for a prescribed accuracy. The kinetic models constructed in this fashion retain some features of the continuum potentials, so that the algorithm provides a systematic and consistent treatment of mechanical-chemical responses such as load-velocity relations, which are difficult to capture with a priori kinetic models. Several numerical examples are given to illustrate the performance of the method.  相似文献   

18.
Scaffolding, i.e. ordering and orienting contigs is an important step in genome assembly. We present a method for scaffolding using second generation sequencing reads based on likelihoods of genome assemblies. A generative model for sequencing is used to obtain maximum likelihood estimates of gaps between contigs and to estimate whether linking contigs into scaffolds would lead to an increase in the likelihood of the assembly. We then link contigs if they can be unambiguously joined or if the corresponding increase in likelihood is substantially greater than that of other possible joins of those contigs. The method is implemented in a tool called Swalo with approximations to make it efficient and applicable to large datasets. Analysis on real and simulated datasets reveals that it consistently makes more or similar number of correct joins as other scaffolders while linking very few contigs incorrectly, thus outperforming other scaffolders and demonstrating that substantial improvement in genome assembly may be achieved through the use of statistical models. Swalo is freely available for download at https://atifrahman.github.io/SWALO/.  相似文献   

19.
MOTIVATION: A consensus sequence for a family of related sequences is, as the name suggests, a sequence that captures the features common to most members of the family. Consensus sequences are important in various DNA sequencing applications and are a convenient way to characterize a family of molecules. RESULTS: This paper describes a new algorithm for finding a consensus sequence, using the popular optimization method known as simulated annealing. Unlike the conventional approach of finding a consensus sequence by first forming a multiple sequence alignment, this algorithm searches for a sequence that minimises the sum of pairwise distances to each of the input sequences. The resulting consensus sequence can then be used to induce a multiple sequence alignment. The time required by the algorithm scales linearly with the number of input sequences and quadratically with the length of the consensus sequence. We present results demonstrating the high quality of the consensus sequences and alignments produced by the new algorithm. For comparison, we also present similar results obtained using ClustalW. The new algorithm outperforms ClustalW in many cases.  相似文献   

20.
A streaming assembly pipeline utilising real-time Oxford Nanopore Technology (ONT) sequencing data is important for saving sequencing resources and reducing time-to-result. A previous approach implemented in npScarf provided an efficient streaming algorithm for hybrid assembly but was relatively prone to mis-assemblies compared to other graph-based methods. Here we present npGraph, a streaming hybrid assembly tool using the assembly graph instead of the separated pre-assembly contigs. It is able to produce more complete genome assembly by resolving the path finding problem on the assembly graph using long reads as the traversing guide. Application to synthetic and real data from bacterial isolate genomes show improved accuracy while still maintaining a low computational cost. npGraph also provides a graphical user interface (GUI) which provides a real-time visualisation of the progress of assembly. The tool and source code is available at https://github.com/hsnguyen/assembly.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号