首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Remarkable advances in DNA sequencing technology have created a need for de novo genome assembly methods tailored to work with the new sequencing data types. Many such methods have been published in recent years, but assembling raw sequence data to obtain a draft genome has remained a complex, multi-step process, involving several stages of sequence data cleaning, error correction, assembly, and quality control. Successful application of these steps usually requires intimate knowledge of a diverse set of algorithms and software. We present an assembly pipeline called A5 (Andrew And Aaron''s Awesome Assembly pipeline) that simplifies the entire genome assembly process by automating these stages, by integrating several previously published algorithms with new algorithms for quality control and automated assembly parameter selection. We demonstrate that A5 can produce assemblies of quality comparable to a leading assembly algorithm, SOAPdenovo, without any prior knowledge of the particular genome being assembled and without the extensive parameter tuning required by the other assembly algorithm. In particular, the assemblies produced by A5 exhibit 50% or more reduction in broken protein coding sequences relative to SOAPdenovo assemblies. The A5 pipeline can also assemble Illumina sequence data from libraries constructed by the Nextera (transposon-catalyzed) protocol, which have markedly different characteristics to mechanically sheared libraries. Finally, A5 has modest compute requirements, and can assemble a typical bacterial genome on current desktop or laptop computer hardware in under two hours, depending on depth of coverage.  相似文献   

2.
We show that a family of prokaryotic repetitive sequences, called REP (repetitive extragenic palindromic), (Stern et al., 1984) is involved in the formation of chromosomal rearrangements such as duplications. The join-points of seven RecA+ tandem duplications previously characterized in Salmonella typhimurium, that fuse the hisD gene to distant foreign promoters, were cloned and sequenced. In all seven cases they are shown to have originated by recombination between distant REP sequences. Importantly, several join-points had also occurred at REP sequences even in a RecA-background. Thus, REPs can recombine with each other by a RecA(-)-independent mechanism involved in the generation of chromosomal rearrangements. While all RecA+ duplications analysed resulted from recombination between REP sequences, some RecA-duplications did occur also outside of REP sequences, in one case by recombination within a 7 bp homology. Possible roles for the known interaction between DNA gyrase and REP in chromosomal rearrangements are discussed.  相似文献   

3.
REPs are highly repeated intergenic palindromic sequences often clustered into structures called BIMEs including two individual REPs separated by short linker of variable length. They play a variety of key roles in the cell. REPs also resemble the sub-terminal hairpins of the atypical IS200/605 family of insertion sequences which encode Y1 transposases (TnpA(IS200/IS605)). These belong to the HUH endonuclease family, carry a single catalytic tyrosine (Y) and promote single strand transposition. Recently, a new clade of Y1 transposases (TnpA(REP)) was found associated with REP/BIME in structures called REPtrons. It has been suggested that TnpA(REP) is responsible for REP/BIME proliferation over genomes. We analysed and compared REP distribution and REPtron structure in numerous available E. coli and Shigella strains. Phylogenetic analysis clearly indicated that tnpA(REP) was acquired early in the species radiation and was lost later in some strains. To understand REP/BIME behaviour within the host genome, we also studied E. coli K12 TnpA(REP) activity in vitro and demonstrated that it catalyses cleavage and recombination of BIMEs. While TnpA(REP) shared the same general organization and similar catalytic characteristics with TnpA(IS200/IS605) transposases, it exhibited distinct properties potentially important in the creation of BIME variability and in their amplification. TnpA(REP) may therefore be one of the first examples of transposase domestication in prokaryotes.  相似文献   

4.
Repetitive extragenic palindromic (REP) sequences are highly conserved inverted repeat sequences originally discovered in Escherichia coli and Salmonella typhimurium. We have physically mapped these sequences in the E. coli genome by using Southern hybridization of an ordered phage bank of E. coli (Y. Kohara, K. Akiyama, and K. Isono, Cell 50:495-508, 1987) with generic REP probes derived from the REP consensus sequence. The set of REP probe-hybridizing clones was correlated with a set of clones expected to contain REP sequences on the basis of computer searches. We also show that a generic REP probe can be used in Southern hybridization to analyze genomic DNA digested with restriction enzymes to determine genetic relatedness among natural isolates of E. coli. A search for these sequences in other members of the family Enterobacteriaceae shows a consistent correlation between both the number of occurrences and the hybridization strength and genealogical relationship.  相似文献   

5.
Automated correction of genome sequence errors   总被引:3,自引:0,他引:3       下载免费PDF全文
By using information from an assembly of a genome, a new program called AutoEditor significantly improves base calling accuracy over that achieved by previous algorithms. This in turn improves the overall accuracy of genome sequences and facilitates the use of these sequences for polymorphism discovery. We describe the algorithm and its application in a large set of recent genome sequencing projects. The number of erroneous base calls in these projects was reduced by 80%. In an analysis of over one million corrections, we found that AutoEditor made just one error per 8828 corrections. By substantially increasing the accuracy of base calling, AutoEditor can dramatically accelerate the process of finishing genomes, which involves closing all gaps and ensuring minimum quality standards for the final sequence. It also greatly improves our ability to discover single nucleotide polymorphisms (SNPs) between closely related strains and isolates of the same species.  相似文献   

6.
Repetitive extragenic palindromic (REP) sequences are highly conserved inverted repeats present in up to 1000 copies on the Escherichia coli chromosome. We have shown both in vivo and in vitro that REP sequences can stabilize upstream mRNA by blocking the processive action of 3'----5' exonucleases. In a number of operons, mRNA stabilization by REP sequences plays an important role in the control of gene expression. Furthermore, differential mRNA stability mediated by the REP sequences can be responsible for differential gene expression within polycistronic operons. Despite the key role of REP sequences in mRNA stability and gene expression in a number of operons, several lines of evidence suggest that this is unlikely to be the primary reason for the exceptionally high degree of sequence conservation between REP sequences. Other possible functions for REP sequences are discussed. We propose that REP sequences may be a prokaryotic equivalent of 'selfish DNA' and that gene conversion may play a role in the evolution and maintenance of REP sequences.  相似文献   

7.
Bertels F  Rainey PB 《PLoS genetics》2011,7(6):e1002132
Repetitive sequences are a conserved feature of many bacterial genomes. While first reported almost thirty years ago, and frequently exploited for genotyping purposes, little is known about their origin, maintenance, or processes affecting the dynamics of within-genome evolution. Here, beginning with analysis of the diversity and abundance of short oligonucleotide sequences in the genome of Pseudomonas fluorescens SBW25, we show that over-represented short sequences define three distinct groups (GI, GII, and GIII) of repetitive extragenic palindromic (REP) sequences. Patterns of REP distribution suggest that closely linked REP sequences form a functional replicative unit: REP doublets are over-represented, randomly distributed in extragenic space, and more highly conserved than singlets. In addition, doublets are organized as inverted repeats, which together with intervening spacer sequences are predicted to form hairpin structures in ssDNA or mRNA. We refer to these newly defined entities as REPINs (REP doublets forming hairpins) and identify short reads from population sequencing that reveal putative transposition intermediates. The proximal relationship between GI, GII, and GIII REPINs and specific REP-associated tyrosine transposases (RAYTs), combined with features of the putative transposition intermediate, suggests a mechanism for within-genome dissemination. Analysis of the distribution of REPs in a range of RAYT-containing bacterial genomes, including Escherichia coli K-12 and Nostoc punctiforme, show that REPINs are a widely distributed, but hitherto unrecognized, family of miniature non-autonomous mobile DNA.  相似文献   

8.
B Becerril  F Valle  E Merino  L Riba  F Bolivar 《Gene》1985,37(1-3):53-62
Deletions of the 3' flanking DNA region of the glutamate dehydrogenase (GDH) structural gene from Escherichia coli K-12, have been produced on a plasmid that carries the complete gdhA gene. Those deletions include part of the repetitive extragenic palindromic (REP) sequences proposed by Stern et al. [Cell 37 (1984) 1015-1026], as a novel and major feature of the bacterial genome. The effect of these deletions on the final GDH level in the cell, has been determined. A broader compilation, analysis and alternative functions of the REP sequences, is also presented.  相似文献   

9.
Palindromic Units (PU or REP) were defined as DNA sequences of 40 nucleotides highly repeated on the genome of Escherichia coli and other Enterobacteriaceae. PU are found in clusters of up to six occurrences always localized in extragenic regions. By sorting the DNA sequences of the known PU containing regions into different classes, we show here for the first time that, besides the PU themselves, each PU clusters contains a number of other conserved sequence motifs. Seven such motifs were identified with the present list of PU regions. Remarkably, each PU cluster is exclusively composed of a mosaic combination of PU and of these other sequence motifs. We demonstrate directly by hybridization experiments that one of these motifs (called L) is indeed present at a large number of copies on the Escherichia coli chromosome and that its distribution follows the same species specificity as PU sequences themselves. We propose that the mosaic pattern of motif combination in PU clusters reveals a new type of bacterial genetic element which we propose to call BIME for Bacterial Interspersed Mosaic Element. The Escherichia coli genome contains about 500 BIME.  相似文献   

10.
An important step in ‘metagenomics’ analysis is the assembly of multiple genomes from mixed sequence reads of multiple species in a microbial community. Most conventional pipelines use a single-genome assembler with carefully optimized parameters. A limitation of a single-genome assembler for de novo metagenome assembly is that sequences of highly abundant species are likely misidentified as repeats in a single genome, resulting in a number of small fragmented scaffolds. We extended a single-genome assembler for short reads, known as ‘Velvet’, to metagenome assembly, which we called ‘MetaVelvet’, for mixed short reads of multiple species. Our fundamental concept was to first decompose a de Bruijn graph constructed from mixed short reads into individual sub-graphs, and second, to build scaffolds based on each decomposed de Bruijn sub-graph as an isolate species genome. We made use of two features, the coverage (abundance) difference and graph connectivity, for the decomposition of the de Bruijn graph. For simulated datasets, MetaVelvet succeeded in generating significantly higher N50 scores than any single-genome assemblers. MetaVelvet also reconstructed relatively low-coverage genome sequences as scaffolds. On real datasets of human gut microbial read data, MetaVelvet produced longer scaffolds and increased the number of predicted genes.  相似文献   

11.
A strategy for assembling the maize (Zea mays L.) genome   总被引:2,自引:0,他引:2  
Because the bulk of the maize (Zea mays L.) genome consists of repetitive sequences, sequencing efforts are being targeted to its 'gene-rich' fraction. Traditional assembly programs are inadequate for this approach because they are optimized for a uniform sampling of the genome and inherently lack the ability to differentiate highly similar paralogs. RESULTS: We report the development of bioinformatics tools for the accurate assembly of the maize genome. This software, which is based on innovative parallel algorithms to ensure scalability, assembled 730,974 genomic survey sequences fragments in 4 h using 64 Pentium III 1.26 GHz processors of a commodity cluster. Algorithmic innovations are used to reduce the number of pairwise alignments significantly without sacrificing quality. Clone pair information was used to estimate the error rate for improved differentiation of polymorphisms versus sequencing errors. The assembly was also used to evaluate the effectiveness of various filtering strategies and thereby provide information that can be used to focus subsequent sequencing efforts.  相似文献   

12.
The whole genome shotgun approach to genome sequencing results in a collection of contigs that must be ordered and oriented to facilitate efficient gap closure. We present a new tool OSLay that uses synteny between matching sequences in a target assembly and a reference assembly to layout the contigs (or scaffolds) in the target assembly. The underlying algorithm is based on maximum weight matching. The tool provides an interactive visualization of the computed layout and the result can be imported into the assembly editing tool Consed to support the design of primer pairs for gap closure. MOTIVATION: To enhance efficiency in the gap closure phase of a genome project it is crucial to know which contigs are adjacent in the target genome. Related genome sequences can be used to layout contigs in an assembly. AVAILABILITY: OSLay is freely available from: http://www-ab.informatik.unituebingen.de/software/oslay.  相似文献   

13.
As a result of improvements in genome assembly algorithms and the ever decreasing costs of high-throughput sequencing technologies, new high quality draft genome sequences are published at a striking pace. With well-established methodologies, larger and more complex genomes are being tackled, including polyploid plant genomes. Given the similarity between multiple copies of a basic genome in polyploid individuals, assembly of such data usually results in collapsed contigs that represent a variable number of homoeologous genomic regions. Unfortunately, such collapse is often not ideal, as keeping contigs separate can lead both to improved assembly and also insights about how haplotypes influence phenotype. Here, we describe a first step in avoiding inappropriate collapse during assembly. In particular, we describe ConPADE (Contig Ploidy and Allele Dosage Estimation), a probabilistic method that estimates the ploidy of any given contig/scaffold based on its allele proportions. In the process, we report findings regarding errors in sequencing. The method can be used for whole genome shotgun (WGS) sequencing data. We also show applicability of the method for variant calling and allele dosage estimation. Results for simulated and real datasets are discussed and provide evidence that ConPADE performs well as long as enough sequencing coverage is available, or the true contig ploidy is low. We show that ConPADE may also be used for related applications, such as the identification of duplicated genes in fragmented assemblies, although refinements are needed.  相似文献   

14.
We developed a new platform for genome-wide gene expression analysis in any eukaryotic organism, which we called SuperSAGE array. The SuperSAGE array is a microarray onto which 26-bp oligonucleotides corresponding to SuperSAGE tag sequences are directly synthesized. A SuperSAGE array combines the advantages of the highly quantitative SuperSAGE expression analysis with the high-throughput microarray technology. We demonstrated highly reproducible gene expression profiling by the SuperSAGE array for 1,000 genes (tags) in rice. We also applied this technology to the detailed study of expressed genes identified by SuperSAGE in Nicotiana benthamiana, an organism for which sufficient genome sequence information is not available. We propose that the SuperSAGE array system represents a new paradigm for microarray construction, as no genomic or cDNA sequence data are required for its preparation.  相似文献   

15.
16.
17.
18.
The parasitic nematode, Brugia malayi, causes lymphatic filariasis in humans, which in severe cases leads to the condition known as elephantiasis. The parasite contains an endosymbiotic alpha-proteobacterium of the genus Wolbachia that is required for normal worm development and fecundity and is also implicated in the pathology associated with infections by these filarial nematodes. Bacterial artificial chromosome libraries were constructed from B. malayi DNA and provide over 11-fold coverage of the nematode genome. Wolbachia genomic fragments were simultaneously cloned into the libraries giving over 5-fold coverage of the 1.1 Mb bacterial genome. A physical framework for the Wolbachia genome was developed by construction of a plasmid library enriched for Wolbachia DNA as a source of sequences to hybridise to high-density bacterial artificial chromosome colony filters. Bacterial artificial chromosome end sequencing provided additional Wolbachia probe sequences to facilitate assembly of a contig that spanned the entire genome. The Wolbachia sequences provided a marker approximately every 10 kb. Four rare-cutting restriction endonucleases were used to restriction map the genome to a resolution of approximately 60 kb and demonstrate concordance between the bacterial artificial chromosome clones and native Wolbachia genomic DNA. Comparison of Wolbachia sequences to public databases using BLAST algorithms under stringent conditions allowed confident prediction of 69 Wolbachia peptide functions and two rRNA genes. Comparison to closely related complete genomes revealed that while most sequences had orthologs in the genome of the Wolbachia endosymbiont from Drosophila melanogaster, there was no evidence for long-range synteny. Rather, there were a few cases of short-range conservation of gene order extending over regions of less than 10 kb. The molecular scaffold produced for the genome of the Wolbachia from B. malayi forms the basis of a genomic sequencing effort for this bacterium, circumventing the difficult challenge of purifying sufficient endosymbiont DNA from a tropical parasite for a whole genome shotgun sequencing strategy.  相似文献   

19.
The generation of a 7.5x dog genome assembly provides exciting new opportunities to interpret tumor-associated chromosome aberrations at the biological level. We present a genomic microarray for array comparative genomic hybridization (aCGH) analysis in the dog, comprising 275 bacterial artificial chromosome (BAC) clones spaced at intervals of approximately 10 Mb. Each clone has been positioned accurately within the genome assembly and assigned to a unique chromosome location by fluorescence in situ hybridization (FISH) analysis, both individually and as chromosome-specific BAC pools. The microarray also contains clones representing the dog orthologues of 31 genes implicated in human cancers. FISH analysis of the 10-Mb BAC clone set indicated excellent coverage of each dog chromosome by the genome assembly. The order of clones was consistent with the assembly, but the cytogenetic intervals between clones were variable. We demonstrate the application of the BAC array for aCGH analysis to identify both whole and partial chromosome imbalances using a canine histiocytic sarcoma case. Using BAC clones selected from the array as probes, multicolor FISH analysis was used to further characterize these imbalances, revealing numerous structural chromosome rearrangements. We outline the value of a combined aCGH/FISH approach, together with a well-annotated dog genome assembly, in canine and comparative cancer studies.  相似文献   

20.
The assembly of a reference genome sequence of bread wheat is challenging due to its specific features such as the genome size of 17 Gbp, polyploid nature and prevalence of repetitive sequences. BAC‐by‐BAC sequencing based on chromosomal physical maps, adopted by the International Wheat Genome Sequencing Consortium as the key strategy, reduces problems caused by the genome complexity and polyploidy, but the repeat content still hampers the sequence assembly. Availability of a high‐resolution genomic map to guide sequence scaffolding and validate physical map and sequence assemblies would be highly beneficial to obtaining an accurate and complete genome sequence. Here, we chose the short arm of chromosome 7D (7DS) as a model to demonstrate for the first time that it is possible to couple chromosome flow sorting with genome mapping in nanochannel arrays and create a de novo genome map of a wheat chromosome. We constructed a high‐resolution chromosome map composed of 371 contigs with an N50 of 1.3 Mb. Long DNA molecules achieved by our approach facilitated chromosome‐scale analysis of repetitive sequences and revealed a ~800‐kb array of tandem repeats intractable to current DNA sequencing technologies. Anchoring 7DS sequence assemblies obtained by clone‐by‐clone sequencing to the 7DS genome map provided a valuable tool to improve the BAC‐contig physical map and validate sequence assembly on a chromosome‐arm scale. Our results indicate that creating genome maps for the whole wheat genome in a chromosome‐by‐chromosome manner is feasible and that they will be an affordable tool to support the production of improved pseudomolecules.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号