首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Correcting errors in shotgun sequences   总被引:3,自引:1,他引:3       下载免费PDF全文
Sequencing errors in combination with repeated regions cause major problems in shotgun sequencing, mainly due to the failure of assembly programs to distinguish single base differences between repeat copies from erroneous base calls. In this paper, a new strategy designed to correct errors in shotgun sequence data using defined nucleotide positions, DNPs, is presented. The method distinguishes single base differences from sequencing errors by analyzing multiple alignments consisting of a read and all its overlaps with other reads. The construction of multiple alignments is performed using a novel pattern matching algorithm, which takes advantage of the symmetry between indices that can be computed for similar words of the same length. This allows for rapid construction of multiple alignments, with no previous pair-wise matching of sequence reads required. Results from a C++ implementation of this method show that up to 99% of sequencing errors can be corrected, while up to 87% of the single base differences remain and up to 80% of the corrected reads contain at most one error. The results also show that the method outperforms the error correction method used in the EULER assembler. The prototype software, MisEd, is freely available from the authors for academic use.  相似文献   

2.

Background

Most eukaryotic genomes include a substantial repeat-rich fraction termed heterochromatin, which is concentrated in centric and telomeric regions. The repetitive nature of heterochromatic sequence makes it difficult to assemble and analyze. To better understand the heterochromatic component of the Drosophila melanogaster genome, we characterized and annotated portions of a whole-genome shotgun sequence assembly.

Results

WGS3, an improved whole-genome shotgun assembly, includes 20.7 Mb of draft-quality sequence not represented in the Release 3 sequence spanning the euchromatin. We annotated this sequence using the methods employed in the re-annotation of the Release 3 euchromatic sequence. This analysis predicted 297 protein-coding genes and six non-protein-coding genes, including known heterochromatic genes, and regions of similarity to known transposable elements. Bacterial artificial chromosome (BAC)-based fluorescence in situ hybridization analysis was used to correlate the genomic sequence with the cytogenetic map in order to refine the genomic definition of the centric heterochromatin; on the basis of our cytological definition, the annotated Release 3 euchromatic sequence extends into the centric heterochromatin on each chromosome arm.

Conclusions

Whole-genome shotgun assembly produced a reliable draft-quality sequence of a significant part of the Drosophila heterochromatin. Annotation of this sequence defined the intron-exon structures of 30 known protein-coding genes and 267 protein-coding gene models. The cytogenetic mapping suggests that an additional 150 predicted genes are located in heterochromatin at the base of the Release 3 euchromatic sequence. Our analysis suggests strategies for improving the sequence and annotation of the heterochromatic portions of the Drosophila and other complex genomes.  相似文献   

3.
Correcting errors in synthetic DNA through consensus shuffling   总被引:4,自引:2,他引:4       下载免费PDF全文
Although efficient methods exist to assemble synthetic oligonucleotides into genes and genomes, these suffer from the presence of 1–3 random errors/kb of DNA. Here, we introduce a new method termed consensus shuffling and demonstrate its use to significantly reduce random errors in synthetic DNA. In this method, errors are revealed as mismatches by re-hybridization of the population. The DNA is fragmented, and mismatched fragments are removed upon binding to an immobilized mismatch binding protein (MutS). PCR assembly of the remaining fragments yields a new population of full-length sequences enriched for the consensus sequence of the input population. We show that two iterations of consensus shuffling improved a population of synthetic green fluorescent protein (GFPuv) clones from ~60 to >90% fluorescent, and decreased errors 3.5- to 4.3-fold to final values of ~1 error per 3500 bp. In addition, two iterations of consensus shuffling corrected a population of GFPuv clones where all members were non-functional, to a population where 82% of clones were fluorescent. Consensus shuffling should facilitate the rapid and accurate synthesis of long DNA sequences.  相似文献   

4.
5.

Background  

Many genome projects are left unfinished due to complex, repeated regions. Finishing is the most time consuming step in sequencing and current finishing tools are not designed with particular attention to the repeat problem.  相似文献   

6.
Two-dimensional imaging with a single camera assumes that the motion occurs in a calibrated plane perpendicular to the camera axis. It is well known that kinematic errors result if the object fails to remain in this plane and that if both the distance to the calibration plane from the camera and the distance out-of-plane are known, an analytical correction for the out-of-plane error can be made. Less well appreciated is that out-of-plane distance can frequently be acquired from other, nonimage-related information. In the two examples given, the mediolateral center of pressure coordinate of the foot measured from a force plate and the measured landing point of a shot put throw were used. In both cases, the resulting out-of-plane correction improved the accuracy of the 2-D kinematic data dramatically. These examples also demonstrate that the use of nonimage-related data can increase the accuracy of kinematic data without an increase in the complexity of the experiment.  相似文献   

7.
"Molecular signatures" are the qualitative and quantitative patterns of groups of biomolecules (e.g., mRNA, proteins, peptides, or metabolites) in a cell, tissue, biological fluid, or an entire organism. To apply this concept to biomarker discovery, the measurements should ideally be noninvasive and performed in a single read-out. We have therefore developed a peptidomics platform that couples magnetics-based, automated solid-phase extraction of small peptides with a high-resolution MALDI-TOF mass spectrometric readout (Villanueva, J.; Philip, J.; Entenberg, D.; Chaparro, C. A.; Tanwar, M. K.; Holland, E. C.; Tempst, P. Anal. Chem. 2004, 76, 1560-1570). Since hundreds of peptides can be detected in microliter volumes of serum, it allows to search for disease signatures, for instance in the presence of cancer. We have now evaluated, optimized, and standardized a number of clinical and analytical chemistry variables that are major sources of bias; ranging from blood collection and clotting, to serum storage and handling, automated peptide extraction, crystallization, spectral acquisition, and signal processing. In addition, proper alignment of spectra and user-friendly visualization tools are essential for meaningful, certifiable data mining. We introduce a minimal entropy algorithm, "Entropycal", that simplifies alignment and subsequent statistical analysis and increases the percentage of the highly distinguishing spectral information being retained after feature selection of the datasets. Using the improved analytical platform and tools, and a commercial statistics program, we found that sera from thyroid cancer patients can be distinguished from healthy controls based on an array of 98 discriminant peptides. With adequate technological and computational methods in place, and using rigorously standardized conditions, potential sources of patient related bias (e.g., gender, age, genetics, environmental, dietary, and other factors) may now be addressed.  相似文献   

8.
Finishing, i.e. gap closure and editing, is the most time-consuming part of genome sequencing. Repeated sequences together with sequencing errors complicate the assembly and often result in misassemblies that are difficult to correct. Repeat Discrepancy Tagger (ReDiT) is a tool designed to aid in the finishing step. This software processes assembly results produced by any fragment assembly program that outputs ace files. The input sequences are analyzed to determine possible differences between repeated sequences. The output is written as tags in an ace file that can be viewed by, e.g. the Consed sequence editor. AVAILABILITY: The ReDiT program is freely available at http://web.cgb.ki.se/redit  相似文献   

9.
Tau is a neuronal microtubule-associated protein that promotes microtubule assembly, stability, and bundling in axons. Two distinct regions of tau are important for the tau-microtubule interaction, a relatively well-characterized repeat region in the carboxyl terminus (containing either three or four imperfect 18-amino acid repeats separated by 13- or 14-amino acid long inter-repeats) and a more centrally located, relatively poorly characterized proline-rich region. By using amino-terminal truncation analyses of tau, we have localized the microtubule binding activity of the proline-rich region to Lys215-Asn246 and identified a small sequence within this region, 215KKVAVVR221, that exerts a strong influence on microtubule binding and assembly in both three- and four-repeat tau isoforms. Site-directed mutagenesis experiments indicate that these capabilities are derived largely from Lys215/Lys216 and Arg221. In marked contrast to synthetic peptides corresponding to the repeat region, peptides corresponding to Lys215-Asn246 and Lys215-Thr222 alone possess little or no ability to promote microtubule assembly, and the peptide Lys215-Thr222 does not effectively suppress in vitro microtubule dynamics. However, combining the proline-rich region sequences (Lys215-Asn246) with their adjacent repeat region sequences within a single peptide (Lys215-Lys272) enhances microtubule assembly by 10-fold, suggesting intramolecular interactions between the proline-rich and repeat regions. Structural complexity in this region of tau also is suggested by sequential amino-terminal deletions through the proline-rich and repeat regions, which reveal an unusual pattern of loss and gain of function. Thus, these data lead to a model in which efficient microtubule binding and assembly activities by tau require intramolecular interactions between its repeat and proline-rich regions. This model, invoking structural complexity for the microtubule-bound conformation of tau, is fundamentally different from previous models of tau structure and function, which viewed tau as a simple linear array of independently acting tubulin-binding sites.  相似文献   

10.
Double-barreled (DB) data have been widely used for the assembly of large genomes. Based on the experience of building the whole-genome working draft of Oryza sativa L.ssp. Indica, we present here the prevailing and improved uses of DB data in the assembly procedure and report on novel applications during the following data-mining processes such as acquiring precise insert fragment information of each clone across the genome, and a new kind of Iow-cost whole-genome microarray. With the increasing number of organisms being sequenced,we believe that DB data will play an important role both in other assembly procedures and infuture genomic studies.  相似文献   

11.
12.
Spatio-temporal regulation of the cell death machinery is essential for normal development and homeostasis of multicellular organisms. While the molecular basis for the central cell death machinery driven by caspases is now well documented, its regulatory mechanisms, especially in the context of living animals, remain to be clarified. The c-Jun N-terminal kinase (JNK) pathway is an evolutionarily conserved kinase cascade that regulates the apoptotic machinery. In mammals, JNK signaling has been implicated in stress-induced apoptosis. Drosophila genetics has now provided evidence of a novel role for JNK-mediated cell death signaling in eliminating developmentally aberrant cells from a tissue. The JNK-dependent cell-elimination system is orchestrated by cell-cell communication between normal and aberrant cells and is essential for ensuring developmental robustness, as well as for protecting organisms against fatal abnormalities such as neoplastic development. These processes are mediated by cell competition, morphogenetic apoptosis, and intrinsic tumor suppression. A combinatorial approach using both genetic and live-imaging systems in Drosophila will be extremely powerful to decipher how JNK-mediated apoptosis works within multicellular communities.  相似文献   

13.
In this paper, we propose an efficient, reliable shotgun sequence assembly algorithm based on a fingerprinting scheme that is robust to both noise and repetitive sequences in the data, two primary roadblocks to effective whole-genome shotgun sequencing. Our algorithm uses exact matches of short patterns randomly selected from fragment data to identify fragment overlaps, construct an overlap map, and deliver a consensus sequence. We show how statistical clues made explicit in our approach can easily be exploited to correctly assemble results even in the presence of extensive repetitive sequences. Our approach is both accurate and exceptionally fast in practice: e.g., we have correctly assembled the whole Mycoplasma genitalium genome (approximately 580 kbp) is roughly 8 minutes of 64MB 200MHz Pentium Pro CPU time from real shotgun data, where most existing algorithms can be expected to run for several hours to a day on the same data. Moreover, experiments with artificially-shotgunned data prepared from real DNA sequences from a wide range of organisms (including human DNA) and containing complex repeating regions demonstrate our algorithm's robustness to input noise and the presence of repetitive sequences. For example, we have correctly assembled a 238-kbp human DNA sequence in less than 3 min of 64-MB 200-MHz Pentium Pro CPU time.  相似文献   

14.

Background

Tandem repeat variation in protein-coding regions will alter protein length and may introduce frameshifts. Tandem repeat variants are associated with variation in pathogenicity in bacteria and with human disease. We characterized tandem repeat polymorphism in human proteins, using the UniGene database, and tested whether these were associated with host defense roles.

Results

Protein-coding tandem repeat copy-number polymorphisms were detected in 249 tandem repeats found in 218 UniGene clusters; observed length differences ranged from 2 to 144 nucleotides, with unit copy lengths ranging from 2 to 57. This corresponded to 1.59% (218/13,749) of proteins investigated carrying detectable polymorphisms in the copy-number of protein-coding tandem repeats. We found no evidence that tandem repeat copy-number polymorphism was significantly elevated in defense-response proteins (p = 0.882). An association with the Gene Ontology term 'protein-binding' remained significant after covariate adjustment and correction for multiple testing. Combining this analysis with previous experimental evaluations of tandem repeat polymorphism, we estimate the approximate mean frequency of tandem repeat polymorphisms in human proteins to be 6%. Because 13.9% of the polymorphisms were not a multiple of three nucleotides, up to 1% of proteins may contain frameshifting tandem repeat polymorphisms.

Conclusion

Around 1 in 20 human proteins are likely to contain tandem repeat copy-number polymorphisms within coding regions. Such polymorphisms are not more frequent among defense-response proteins; their prevalence among protein-binding proteins may reflect lower selective constraints on their structural modification. The impact of frameshifting and longer copy-number variants on protein function and disease merits further investigation.  相似文献   

15.

Background  

Tandem repeat variation in protein-coding regions will alter protein length and may introduce frameshifts. Tandem repeat variants are associated with variation in pathogenicity in bacteria and with human disease. We characterized tandem repeat polymorphism in human proteins, using the UniGene database, and tested whether these were associated with host defense roles.  相似文献   

16.
17.
18.
MapLinker is an analysis tool, as well as a browsing interface, that facilitates integration of whole genome sequence assembly with a clone-based physical map. Using the locations of sequence markers on the physical map, MapLinker generates a tentative sequence map of the genome that serves to verify the map and to guide genome-wide finishing.  相似文献   

19.
20.
The nuclear pore complex, through the interaction of its proteins with transport receptors, controls the transport of large molecules into and out of the cell's nucleus. There is ample evidence for proteins with FG sequence repeats playing an essential role in this control. Previous studies have elucidated binding spots for FG sequence repeats on the surface of the transport receptor importin-beta by X-ray crystallography and mutational studies. Molecular dynamics simulations have been performed to characterize the interaction of FG sequence repeats with the transport receptor. Observed binding spots have been verified and novel sites discovered, suggesting that importin-beta features many more binding spots than suspected so far. The observed binding spots are in accord with several models of nucleocytoplasmic transport, and the large number of binding spots on importin-beta may be necessary for the pore complex to distinguish between importin-beta and inert proteins, and to allow for its passage through the pore.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号