首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
2.
In pairwise end sequencing, sequences are determined from both ends of random subclones derived from a DNA target. Sufficiently similar overlapping end sequences are identified and grouped into contigs. When a clone's paired end sequences fall in different contigs, the contigs are connected together to form scaffolds. Increasingly, the goals of pairwise strategies are large and highly repetitive genomic targets. Here, we consider large-scale pairwise strategies that employ mixtures of subclone sizes. We explore the properties of scaffold formation within a hybrid theory/simulation mathematical model of a genomic target that contains many repeat families. Using this model, we evaluate problems that may arise, such as falsely linked end sequences (due either to random matches or to homologous repeats) and scaffolds that terminate without extending the full length of the target. We illustrate our model with an exploration of a strategy for sequencing the human genome. Our results show that, for a strategy that generates 10-fold sequence coverage derived from the ends of clones ranging in length from 2 to 150 kb, using an appropriate rule for detecting overlaps, we expect few false links while obtaining a single scaffold extending the length of each chromosome.  相似文献   

3.
MOTIVATION: Investigators utilize gap estimates for DNA sequencing projects. Standard theories assume sequences are independently and identically distributed, leading to appreciable under-prediction of gaps. RESULTS: Using a statistical scaling factor and data from 20 representative whole genome shotgun projects, we construct regression equations that relate coverage to a normalized gap measure. Prokaryotic genomes do not correlate to sequence coverage, while eukaryotes show strong correlation if the chaff is ignored. Gaps decrease at an exponential rate of only about one-third of that predicted via theory alone. Case studies suggest that departure from theory can largely be attributed to assembly difficulties for repeat-rich genomes, but bias and coverage anomalies are also important when repeats are sparse. Such factors cannot be readily characterized a priori, suggesting upper limits on the accuracy of gap prediction. We also find that diminishing coverage probability discussed in other studies is a theoretical artifact that does not arise for the typical project.  相似文献   

4.
Tandem repeats often confound large genome assemblies. A survey of tandemly arrayed repetitive sequences was carried out in whole genome sequences of the green alga Chlamydomonas reinhardtii, the moss Physcomitrella patens, the monocots rice and sorghum, and the dicots Arabidopsis thaliana, poplar, grapevine, and papaya, in order to test how these assemblies deal with this fraction of DNA. Our results suggest that plant genome assemblies preferentially include tandem repeats composed of shorter monomeric units (especially dinucleotide and 9–30-bp repeats), while higher repetitive units pose more difficulties to assemble. Nevertheless, notwithstanding that currently available sequencing technologies struggle with higher arrays of repeated DNA, major well-known repetitive elements including centromeric and telomeric repeats as well as high copy-number genes, were found to be reasonably well represented. A database including all tandem repeat sequences characterized here was created to benefit future comparative genomic analyses.  相似文献   

5.
Genomic V exons from whole genome shotgun data in reptiles   总被引:1,自引:0,他引:1  
Reptiles and mammals diverged over 300 million years ago, creating two parallel evolutionary lineages amongst terrestrial vertebrates. In reptiles, two main evolutionary lines emerged: one gave rise to Squamata, while the other gave rise to Testudines, Crocodylia, and Aves. In this study, we determined the genomic variable (V) exons from whole genome shotgun sequencing (WGS) data in reptiles corresponding to the three main immunoglobulin (IG) loci and the four main T cell receptor (TR) loci. We show that Squamata lack the TRG and TRD genes, and snakes lack the IGKV genes. In representative species of Testudines and Crocodylia, the seven major IG and TR loci are maintained. As in mammals, genes of the IG loci can be grouped into well-defined IMGT clans through a multi-species phylogenetic analysis. We show that the reptilian IGHV and IGLV genes are distributed amongst the established mammalian clans, while their IGKV genes are found within a single clan, nearly exclusive from the mammalian sequences. The reptilian and mammalian TRAV genes cluster into six common evolutionary clades (since IMGT clans have not been defined for TR). In contrast, the reptilian TRBV genes cluster into three clades, which have few mammalian members. In this locus, the V exon sequences from mammals appear to have undergone different evolutionary diversification processes that occurred outside these shared reptilian clans. These sequences can be obtained in a freely available public repository (http://vgenerepertoire.org).  相似文献   

6.
Circular genome visualization and exploration using CGView   总被引:1,自引:0,他引:1  
SUMMARY: CGView (Circular Genome Viewer) is a Java application and library for generating high-quality, zoomable maps of circular genomes. It converts XML or tab-delimited input into a graphical map (PNG, JPG or Scalable Vector Graphics format), complete with sequence features, labels, legends and footnotes. In addition to the default full view map, the program can generate a series of hyperlinked maps showing expanded views. The linked maps can be explored using any Web browser, allowing rapid genome browsing and facilitating data sharing. AVAILABILITY: CGView (the standalone application, library or applet), sample input, sample maps and documentation can be obtained from http://wishart.biology.ualberta.ca/cgview/ CONTACT: david.wishart@ualberta.ca.  相似文献   

7.
MapLinker is an analysis tool, as well as a browsing interface, that facilitates integration of whole genome sequence assembly with a clone-based physical map. Using the locations of sequence markers on the physical map, MapLinker generates a tentative sequence map of the genome that serves to verify the map and to guide genome-wide finishing.  相似文献   

8.
An increasingly important problem in genome sequencing is the failure of the commonly used shotgun assembly programs to correctly assemble repetitive sequences. The assembly of non-repetitive regions or regions containing repeats considerably shorter than the average read length is in practice easy to solve, while longer repeats have been a difficult problem. We here present a statistical method to separate arbitrarily long, almost identical repeats, which makes it possible to correctly assemble complex repetitive sequence regions. The differences between repeat units may be as low as 1% and the sequencing error may be up to ten times higher. The method is based on the realization that a comparison of only a part of all overlapping sequences at a time in a data set does not generate enough information for a conclusive analysis. Our method uses optimal multi-alignments consisting of all the overlaps of each read. This makes it possible to determine defined nucleotide positions, DNPs, which constitute the differences between the repeat units. Differences between repeats are distinguished from sequencing errors using statistical methods, where the probabilities of obtaining certain combinations of candidate DNPs are calculated using the information from the multi-alignments. The use of DNPs and combinations of DNPs will allow for optimal and rapid assemblies of repeated regions. This method can solve repeats that differ in only two positions in a read length, which is the theoretical limit for repeat separation. We predict that this method will be highly useful in shotgun sequencing in the future.  相似文献   

9.
10.
Whole genome DNA microarrays were constructed and used to investigate genomic diversity in 18 Campylobacter jejuni strains from diverse sources. New algorithms were developed that dynamically determine the boundary between the conserved and variable genes. Seven hypervariable plasticity regions (PR) were identified in the genome (PR1 to PR7) containing 136 genes (50%) of the variable gene pool. When comparisons were made with the sequenced strain NCTC11168, the number of absent or divergent genes ranged from 2.6% (40 genes) to 10.2% (163) and in total 16.3% (269) of the genes were variable. PR1 contains genes important in the utilisation of alternative electron acceptors for respiration and may confer a selective advantage to strains in restricted oxygen environments. PR2, 3 and 7 contain many outer membrane and periplasmic proteins and hypothetical proteins of unknown function that might be linked to phenotypic variation and adaptation to different ecological niches. PR4, 5 and 6 contain genes involved in the production and modification of antigenic surface structures.  相似文献   

11.
12.
We describe an algorithm, ReAS, to recover ancestral sequences for transposable elements (TEs) from the unassembled reads of a whole genome shotgun. The main assumptions are that these TEs must exist at high copy numbers across the genome and must not be so old that they are no longer recognizable in comparison to their ancestral sequences. Tested on the japonica rice genome, ReAS was able to reconstruct all of the high copy sequences in the Repbase repository of known TEs, and increase the effectiveness of RepeatMasker in identifying TEs from genome sequences.  相似文献   

13.

Background  

The ability to visualize genomic features and design experimental assays that can target specific regions of a genome is essential for modern biology. To assist in these tasks, we present Genomorama, a software program for interactively displaying multiple genomes and identifying potential DNA hybridization sites for assay design.  相似文献   

14.
Hubisz MJ  Lin MF  Kellis M  Siepel A 《PloS one》2011,6(2):e17034
The recent release of twenty-two new genome sequences has dramatically increased the data available for mammalian comparative genomics, but twenty of these new sequences are currently limited to ~2× coverage. Here we examine the extent of sequencing error in these 2× assemblies, and its potential impact in downstream analyses. By comparing 2× assemblies with high-quality sequences from the ENCODE regions, we estimate the rate of sequencing error to be 1-4 errors per kilobase. While this error rate is fairly modest, sequencing error can still have surprising effects. For example, an apparent lineage-specific insertion in a coding region is more likely to reflect sequencing error than a true biological event, and the length distribution of coding indels is strongly distorted by error. We find that most errors are contributed by a small fraction of bases with low quality scores, in particular, by the ends of reads in regions of single-read coverage in the assembly. We explore several approaches for automatic sequencing error mitigation (SEM), making use of the localized nature of sequencing error, the fact that it is well predicted by quality scores, and information about errors that comes from comparisons across species. Our automatic methods for error mitigation cannot replace the need for additional sequencing, but they do allow substantial fractions of errors to be masked or eliminated at the cost of modest amounts of over-correction, and they can reduce the impact of error in downstream phylogenomic analyses. Our error-mitigated alignments are available for download.  相似文献   

15.
16.
Two different strategies for determining the human genome are currently being pursued: one is the "clone-by-clone" approach, employed by the publicly funded project, and the other is the "whole genome shotgun assembler" approach, favored by researchers at Celera Genomics. An interim strategy employed at Celera, called compartmentalized shotgun assembly, makes use of preliminary data produced by both approaches. In this paper we describe the design, implementation and operation of the "compartmentalized shotgun assembler".  相似文献   

17.
18.
We report a general method for the detection of restriction fragment length alterations associated with mutations or polymorphisms using whole genomic DNA rather than specific cloned DNA probes. We utilized a modified Southern Cross hybridization to display the hybridization pattern of all size-separated restriction fragments from wild-type Caenorhabditis elegans to all the corresponding fragments in a particular mutant strain and in a distinct C. elegans variety. In this analysis, almost all homologous restriction fragments are the same size in both strains and result in an intense diagonal of hybridization, whereas homologous fragments that differ in size between the two strains generate an off-diagonal spot. To attenuate the contribution of repeated sequences in the genome to spurious off-diagonal spots, restriction fragments from each genome were partially resected with a 3' or 5' exonuclease and not denatured, so that only the DNA sequences at the ends of these fragments could hybridize. Off-diagonal hybridization spots were detected at the expected locations when genomic DNA from wild-type was compared to an unc-54 mutant strain containing a 1.5 kb deletion or to a C. elegans variety that contains dispersed transposon insertions. We suggest that this modified Southern Cross hybridization technique could be used to identify restriction fragment length alterations associated with mutations or genome rearrangements in organisms with DNA complexities as large as 10(8) base pairs and, using rare-cutting enzymes and pulse-field gel electrophoresis, perhaps as large as mammalian genomes. This information could be used to clone fragments associated with such DNA alterations.  相似文献   

19.
Rates of genome evolution and branching order from whole genome analysis   总被引:2,自引:0,他引:2  
Accurate estimation of any phylogeny is important as a framework for evolutionary analysis of form and function at all levels of organization from sequence to whole organism. Using alignments of nonrepetitive components of opossum, human, mouse, rat, and dog genomes we evaluated two alternative tree topologies for eutherian evolution. We show with very high confidence that there is a basal split between rodents (as represented by the mouse and rat) and a branch joining primates (as represented by humans) and carnivores (as represented by dogs), consistent with some but not the most widely accepted mammalian phylogenies. The result was robust to substitution model choice with equivalent inference returned from a spectrum of models ranging from a general time reversible model, a model that treated nucleotides as either purines and pyrimidines, and variants of these that incorporated rate heterogeneity among sites. By determining this particular branching order we are able to show that the rate of molecular evolution is almost identical in rodent and carnivore lineages and that sequences evolve approximately 11%-14% faster in these lineages than in the primate lineage. In addition by applying the chicken as outgroup the analyses suggested that the rate of evolution in all eutherian lineages is approximately 30% slower than in the opossum lineage. This pattern of relative rates is inconsistent with the hypothesis that generation time is an important determinant of substitution rates and, by implication, mutation rates. Possible factors causing rate differences between the lineages include differences in DNA repair and replication enzymology, and shifts in nucleotide pools. Our analysis demonstrates the importance of using multiple sequences from across the genome to estimate phylogeny and relative evolutionary rate in order to reduce the influence of distorting local effects evident even in relatively long sequences.  相似文献   

20.
Current state-of-the-art experimental and computational proteomic approaches were integrated to obtain a comprehensive protein profile of Populus vascular tissue. This featured: (1) a large sample set consisting of two genotypes grown under normal and tension stress conditions, (2) bioinformatics clustering to effectively handle gene duplication, and (3) an informatics approach to track and identify single amino acid polymorphisms (SAAPs). By applying a clustering algorithm to the Populus database, the number of protein entries decreased from 64,689 proteins to a total of 43,069 protein groups, thereby reducing 7505 identified proteins to a total of 4226 protein groups, in which 2016 were singletons. This reduction implies that ~50% of the measured proteins shared extensive sequence homology. Using conservative search criteria, we were able to identify 1354 peptides containing a SAAP and 201 peptides that become tryptic due to a K or R substitution. These newly identified peptides correspond to 502 proteins, including 97 previously unidentified proteins. In total, the integration of deep proteome measurements on an extensive sample set with protein clustering and peptide sequence variants provided an exceptional level of proteome characterization for Populus, allowing us to spatially resolve the vascular tissue proteome.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号