首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Advances in sequencing technologies have led to the increased use of high throughput sequencing in characterizing the microbial communities associated with our bodies and our environment. Critical to the analysis of the resulting data are sequence assembly algorithms able to reconstruct genes and organisms from complex mixtures. Metagenomic assembly involves new computational challenges due to the specific characteristics of the metagenomic data. In this survey, we focus on major algorithmic approaches for genome and metagenome assembly, and discuss the new challenges and opportunities afforded by this new field. We also review several applications of metagenome assembly in addressing interesting biological problems.  相似文献   

2.
High quality lane-tracking of gel images is the first task, and thus a prerequisite, for successful trace processing and base-calling of DNA sequencing slab gels. In most approaches, it is based on statistical calculations, for instance variance and co-variance analysis between neighboring pixel columns in the image. On the basis of these statistical calculations, Kohonen's self-organization neural network model was introduced. We have found that, using several well-structured input data, Kohonen's self-organization neural network model can be trained to fulfill our task of lane-tracking. Furthermore, the quality of lane-tracking could be improved compared to algorithmic approaches.  相似文献   

3.
We present an original approach to identifying sequence variants in a mixed DNA population from sequence trace data. The heart of the method is based on parsimony: given a wildtype DNA sequence, a set of observed variations at each position collected from sequencing data, and a complete catalog of all possible mutations, determine the smallest set of mutations from the catalog that could fully explain the observed variations. The algorithmic complexity of the problem is analyzed for several classes of mutations, including block substitutions, single-range deletions, and single-range insertions. The reconstruction problem is shown to be NP-complete for single-range insertions and deletions, while for block substitutions, single character insertion, and single character deletion mutations, polynomial time algorithms are provided. Once a minimum set of mutations compatible with the observed sequence is found, the relative frequency of those mutations is recovered by solving a system of linear equations. Simulation results show the algorithm successfully deconvolving mutations in p53 known to cause cancer. An extension of the algorithm is proposed as a new method of high throughput screening for single nucleotide polymorphisms by multiplexing DNA.  相似文献   

4.
SUMMARY: We recently developed algorithmic tools for the identification of functionally important regions in proteins of known three dimensional structure by estimating the degree of conservation of the amino-acid sites among their close sequence homologues. Projecting the conservation grades onto the molecular surface of these proteins reveals patches of highly conserved (or occasionally highly variable) residues that are often of important biological function. We present a new web server, ConSurf, which automates these algorithmic tools. ConSurf may be used for high-throughput characterization of functional regions in proteins. AVAILABILITY: The ConSurf web server is available at:http://consurf.tau.ac.il. SUPPLEMENTARY INFORMATION: A set of examples is available at http://consurf.tau.ac.il under 'GALLERY'.  相似文献   

5.
Library preparation protocols for most sequencing technologies involve PCR amplification of the template DNA, which open the possibility that a given template DNA molecule is sequenced multiple times. Reads arising from this phenomenon, known as PCR duplicates, inflate the cost of sequencing and can jeopardize the reliability of affected experiments. Despite the pervasiveness of this artefact, our understanding of its causes and of its impact on downstream statistical analyses remains essentially empirical. Here, we develop a general quantitative model of amplification distortions in sequencing data sets, which we leverage to investigate the factors controlling the occurrence of PCR duplicates. We show that the PCR duplicate rate is determined primarily by the ratio between library complexity and sequencing depth, and that amplification noise (including in its dependence on the number of PCR cycles) only plays a secondary role for this artefact. We confirm our predictions using new and published RAD-seq libraries and provide a method to estimate library complexity and amplification noise in any data set containing PCR duplicates. We discuss how amplification-related artefacts impact downstream analyses, and in particular genotyping accuracy. The proposed framework unites the numerous observations made on PCR duplicates and will be useful to experimenters of all sequencing technologies where DNA availability is a concern.  相似文献   

6.
Yoon  Byung-Jun  Qian  Xiaoning  Kahveci  Tamer  Pal  Ranadip 《BMC genomics》2020,21(9):1-3
Background

Haplotypes, the ordered lists of single nucleotide variations that distinguish chromosomal sequences from their homologous pairs, may reveal an individual’s susceptibility to hereditary and complex diseases and affect how our bodies respond to therapeutic drugs. Reconstructing haplotypes of an individual from short sequencing reads is an NP-hard problem that becomes even more challenging in the case of polyploids. While increasing lengths of sequencing reads and insert sizes helps improve accuracy of reconstruction, it also exacerbates computational complexity of the haplotype assembly task. This has motivated the pursuit of algorithmic frameworks capable of accurate yet efficient assembly of haplotypes from high-throughput sequencing data.

Results

We propose a novel graphical representation of sequencing reads and pose the haplotype assembly problem as an instance of community detection on a spatial random graph. To this end, we construct a graph where each read is a node with an unknown community label associating the read with the haplotype it samples. Haplotype reconstruction can then be thought of as a two-step procedure: first, one recovers the community labels on the nodes (i.e., the reads), and then uses the estimated labels to assemble the haplotypes. Based on this observation, we propose ComHapDet – a novel assembly algorithm for diploid and ployploid haplotypes which allows both bialleleic and multi-allelic variants.

Conclusions

Performance of the proposed algorithm is benchmarked on simulated as well as experimental data obtained by sequencing Chromosome 5 of tetraploid biallelic Solanum-Tuberosum (Potato). The results demonstrate the efficacy of the proposed method and that it compares favorably with the existing techniques.

  相似文献   

7.
8.

Background

The use of novel algorithmic techniques is pivotal to many important problems in life science. For example the sequencing of the human genome [1] would not have been possible without advanced assembly algorithms. However, owing to the high speed of technological progress and the urgent need for bioinformatics tools, there is a widening gap between state-of-the-art algorithmic techniques and the actual algorithmic components of tools that are in widespread use.

Results

To remedy this trend we propose the use of SeqAn, a library of efficient data types and algorithms for sequence analysis in computational biology. SeqAn comprises implementations of existing, practical state-of-the-art algorithmic components to provide a sound basis for algorithm testing and development. In this paper we describe the design and content of SeqAn and demonstrate its use by giving two examples. In the first example we show an application of SeqAn as an experimental platform by comparing different exact string matching algorithms. The second example is a simple version of the well-known MUMmer tool rewritten in SeqAn. Results indicate that our implementation is very efficient and versatile to use.

Conclusion

We anticipate that SeqAn greatly simplifies the rapid development of new bioinformatics tools by providing a collection of readily usable, well-designed algorithmic components which are fundamental for the field of sequence analysis. This leverages not only the implementation of new algorithms, but also enables a sound analysis and comparison of existing algorithms.  相似文献   

9.
Melibiose uptake and hydrolysis in E.coli is performed by the MelB and MelA proteins, respectively. We report the cloning and sequencing of the melA gene. The nucleotide sequence data showed that melA codes for a 450 amino acid long protein with a molecular weight of 50.6 kd. The sequence data also supported the assumption that the mel locus forms an operon with melA in proximal position. A comparison of MelA with alpha-galactosidase proteins from yeast and human origin showed that these proteins have only limited homology, the yeast and human proteins being more related. However, regions common to all three proteins were found indicating sequences that might comprise the active site of alpha-galactosidase.  相似文献   

10.
11.
This tutorial article introduces mass spectrometry (MS) for peptide fragmentation and protein identification. The current approaches being used for protein identification include top-down and bottom-up sequencing. Top-down sequencing, a relatively new approach that involves fragmenting intact proteins directly, is briefly introduced. Bottom-up sequencing, a traditional approach that fragments peptides in the gas phase after protein digestion, is discussed in more detail. The most widely used ion activation and dissociation process, gas-phase collision-activated dissociation (CAD), is discussed from a practical point of view. Infrared multiphoton dissociation (IRMPD) and electron capture dissociation (ECD) are introduced as two alternative dissociation methods. For spectral interpretation, the common fragment ion types in peptide fragmentation and their structures are introduced; the influence of instrumental methods on the fragmentation pathways and final spectra are discussed. A discussion is also provided on the complications in sample preparation for MS analysis. The final section of this article provides a brief review of recent research efforts on different algorithmic approaches being developed to improve protein identification searches.  相似文献   

12.
DNA sequencing by hybridization, potentially a powerful alternative to standard wet lab techniques, has received renewed interest after a novel probing scheme has been recently proposed whose performance for the first time asymptotically meets the information theory bound. After settlement of the question of asymptotic performance, there remains the issue of algorithmic fine tunings aimed at improving the performance "constants," with substantial practical implications. In this paper, we show that a probing scheme based on the joint use of direct and reverse spectra (tandem spectra) for a given gapped probing pattern achieves a performance improvement per unit of microarray area of about 5/4 and does not appear to be susceptible to further improvement by increasing the number of cooperating spectra. In other words, tandem-spectrum reconstruction is the best known technique for sequencing by hybridization.  相似文献   

13.
Sequences in public databases may contain a number of sequencing errors. A double binomial model describing the distribution of indel-excluded similarity coefficients (S) among repeatedly sequenced 16S rRNA was previously developed and it produced a confidence interval of S useful for testing sequence identity among sequences of 400-bp length. We characterized patterns in sequencing errors found in nearly complete 16S rRNA sequences of Vibrionaceae as highly variable in reported sequence length and containing a small number of indels. To accommodate these characteristics, a simple binomial model for distribution of the similarity coefficient (H) that included indels was derived from the double binomial model for S. The model showed good fit to empirical data. By using either a pre-determined or bootstrapping estimated standard probability of base matching, we were able to use the exact binomial test to determine the relative level of sequencing error for a given pair of duplicated sequences. A limitation of the method is the requirement that duplicated sequences for the same template sequence be paired, but this can be overcome by using only conserved regions of 16S rRNA sequences and pairing a given sequence with its highest scoring BLAST search hit from the nr database of GenBank.  相似文献   

14.
The mouse doublefoot (Dbf) mutant exhibits preaxial polydactyly in association with craniofacial defects. This mutation has previously been mapped to mouse chromosome 1. We have used a positional cloning strategy, coupled with a comparative sequencing approach using available human draft sequence, to identify putative candidates for the Dbf gene in the mouse and in homologous human region. We have constructed a high-resolution genetic map of the region, localizing the mutation to a 0. 4-cM (±0.0061) interval on mouse chromosome 1. Furthermore, we have constructed contiguous BAC/PAC clone maps across the mouse and human Dbf region. Using existing markers and additional sequence tagged sites, which we have generated, we have anchored the physical map to the genetic map. Through the comparative sequencing of these clones we have identified 35 genes within this interval, indicating that the region is gene-rich. From this we have identified several genes that are known to be differentially expressed in the developing mid-gestation mouse embryo, some in the developing embryonic limb buds. These genes include those encoding known developmental signaling molecules such as WNT proteins and IHH, and we provide evidence that these genes are candidates for the Dbf mutation.  相似文献   

15.
Schnyder crystalline corneal dystrophy (SCCD, MIM 121800) is a rare autosomal dominant disease characterized by progressive opacification of the cornea resulting from the local accumulation of lipids, and associated in some cases with systemic dyslipidemia. Although previous studies of the genetics of SCCD have localized the defective gene to a 1.58 Mbp interval on chromosome 1p, exhaustive sequencing of positional candidate genes has thus far failed to reveal causal mutations. We have ascertained a large multigenerational family in Nova Scotia affected with SCCD in which we have confirmed linkage to the same general area of chromosome 1. Intensive fine mapping in our family revealed a 1.3 Mbp candidate interval overlapping that previously reported. Sequencing of genes in our interval led to the identification of five putative causal mutations in gene UBIAD1, in our family as well as in four other small families of various geographic origins. UBIAD1 encodes a potential prenyltransferase, and is reported to interact physically with apolipoprotein E. UBIAD1 may play a direct role in intracellular cholesterol biochemistry, or may prenylate other proteins regulating cholesterol transport and storage.  相似文献   

16.
Numerous algorithms have been developed to analyze ChIP-Seq data. However, the complexity of analyzing diverse patterns of ChIP-Seq signals, especially for epigenetic marks, still calls for the development of new algorithms and objective comparisons of existing methods. We developed Qeseq, an algorithm to detect regions of increased ChIP read density relative to background. Qeseq employs critical novel elements, such as iterative recalibration and neighbor joining of reads to identify enriched regions of any length. To objectively assess its performance relative to other 14 ChIP-Seq peak finders, we designed a novel protocol based on Validation Discriminant Analysis (VDA) to optimally select validation sites and generated two validation datasets, which are the most comprehensive to date for algorithmic benchmarking of key epigenetic marks. In addition, we systematically explored a total of 315 diverse parameter configurations from these algorithms and found that typically optimal parameters in one dataset do not generalize to other datasets. Nevertheless, default parameters show the most stable performance, suggesting that they should be used. This study also provides a reproducible and generalizable methodology for unbiased comparative analysis of high-throughput sequencing tools that can facilitate future algorithmic development.  相似文献   

17.
18.
19.
MOTIVATION: As more genomic data becomes available there is increased attention on understanding the mechanisms encoded in the genome. New XML dialects like CellML and Systems Biology Markup Language (SBML) are being developed to describe biological networks of all types. In the absence of detailed kinetic information for these networks, stoichiometric data is an especially valuable source of information. Network databases are the next logical step beyond storing purely genomic information. Just as comparison of entries in genomic databases has been a vital algorithmic problem through the course of the sequencing project, comparison of networks in network databases will be a crucial problem as we seek to integrate higher-order network knowledge. RESULTS: We show that comparing the stoichiometric structure of two reactions systems is equivalent to the graph isomorphism problem. This is encouraging because graph isomorphism is, in practice, a tractable problem using heuristics. The analogous problem of searching for a subsystem of a reaction system is NP-complete. We also discuss heuristic issues in implementations for practical comparison of stoichiometric matrices.  相似文献   

20.
The mouse doublefoot (Dbf) mutant exhibits preaxial polydactyly in association with craniofacial defects. This mutation has previously been mapped to mouse chromosome 1. We have used a positional cloning strategy, coupled with a comparative sequencing approach using available human draft sequence, to identify putative candidates for the Dbf gene in the mouse and in homologous human region. We have constructed a high-resolution genetic map of the region, localizing the mutation to a 0.4-cM (+/-0.0061) interval on mouse chromosome 1. Furthermore, we have constructed contiguous BAC/PAC clone maps across the mouse and human Dbf region. Using existing markers and additional sequence tagged sites, which we have generated, we have anchored the physical map to the genetic map. Through the comparative sequencing of these clones we have identified 35 genes within this interval, indicating that the region is gene-rich. From this we have identified several genes that are known to be differentially expressed in the developing mid-gestation mouse embryo, some in the developing embryonic limb buds. These genes include those encoding known developmental signaling molecules such as WNT proteins and IHH, and we provide evidence that these genes are candidates for the Dbf mutation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号