首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 25 毫秒
1.
A Bayesian approach to DNA sequence segmentation   总被引:3,自引:0,他引:3  
Boys RJ  Henderson DA 《Biometrics》2004,60(3):573-581
Many deoxyribonucleic acid (DNA) sequences display compositional heterogeneity in the form of segments of similar structure. This article describes a Bayesian method that identifies such segments by using a Markov chain governed by a hidden Markov model. Markov chain Monte Carlo (MCMC) techniques are employed to compute all posterior quantities of interest and, in particular, allow inferences to be made regarding the number of segment types and the order of Markov dependence in the DNA sequence. The method is applied to the segmentation of the bacteriophage lambda genome, a common benchmark sequence used for the comparison of statistical segmentation algorithms.  相似文献   

2.
A finite-context (Markov) model of order k yields the probability distribution of the next symbol in a sequence of symbols, given the recent past up to depth k. Markov modeling has long been applied to DNA sequences, for example to find gene-coding regions. With the first studies came the discovery that DNA sequences are non-stationary: distinct regions require distinct model orders. Since then, Markov and hidden Markov models have been extensively used to describe the gene structure of prokaryotes and eukaryotes. However, to our knowledge, a comprehensive study about the potential of Markov models to describe complete genomes is still lacking. We address this gap in this paper. Our approach relies on (i) multiple competing Markov models of different orders (ii) careful programming techniques that allow orders as large as sixteen (iii) adequate inverted repeat handling (iv) probability estimates suited to the wide range of context depths used. To measure how well a model fits the data at a particular position in the sequence we use the negative logarithm of the probability estimate at that position. The measure yields information profiles of the sequence, which are of independent interest. The average over the entire sequence, which amounts to the average number of bits per base needed to describe the sequence, is used as a global performance measure. Our main conclusion is that, from the probabilistic or information theoretic point of view and according to this performance measure, multiple competing Markov models explain entire genomes almost as well or even better than state-of-the-art DNA compression methods, such as XM, which rely on very different statistical models. This is surprising, because Markov models are local (short-range), contrasting with the statistical models underlying other methods, where the extensive data repetitions in DNA sequences is explored, and therefore have a non-local character.  相似文献   

3.
Nucleotide sequence analysis revealed that a DNA length polymorphism 5' to the human antithrombin III gene is due to the presence of 32bp or 108bp nonhomologous nucleotide sequences (variable segments) 345bp upstream from the translation initiation codon. Sequences at the 3' borders of both variable segments can form intrastrand inverted repeat structures with sequences further downstream. An inverted repeat is also found immediately 5' to the site where the variable segments are located. Thus, cruciform structures may form flanking the variable segments of both alleles of this DNA length polymorphism. DNA secondary structure may be detected with single strand specific nucleases. S1 nuclease sensitive sites were mapped in recombinant plasmids containing the cloned alleles of the ATIII length polymorphism. The site most sensitive to S1 is located upstream from the variable segments in an AT-rich segment flanked by 6bp direct repeats. A region of lesser nuclease sensitivity was also observed in the AT-rich loops formed between the inverted repeats 5' to the variable segments.  相似文献   

4.
Eucaryotic transposable genetic elements with inverted terminal repeats   总被引:22,自引:0,他引:22  
S Potter  M Truett  M Phillips  A Maher 《Cell》1980,20(3):639-647
DNA carrying inverted repeats was tested for transposition within the Drosophila genome. Five Bam HI segments containing related inverted repeats were isolated from D. melanogaster and analyzed by electron microscopy and restriction mapping. Southern blot experiments using single-copy flanking sequences as probes allowed the study of DNA arrangements at specific sites in the genomes of five closely related strains. We found that in some genomes the sequences with inverted repeats were present at a particular site, whereas in other genomes they were absent from this site. These results indicated that three of the sequences are transposable genetic elements. In one case we have purified the two corresponding DNA segments, with and without the sequence containing inverted repeats, thereby confirming the mobility of this sequence. These DNA elements were found to be distinct in two ways from copia and others previously described: first, they contain inverted terminal repeats, and second, they have a more heterogeneous construction.  相似文献   

5.
6.
The shufflon of plasmid R64 consists of four DNA segments separated and flanked by seven sfx recombination sites. Rci-mediated recombination between any inverted sfx sequences causes inversion of the DNA segments independently or in groups. The R64 shufflon selects one of seven pilV genes encoding type IV pilus adhesins, in which the N-terminal region is constant, while the C-terminal regions are variable. The R64 sfx sequences are asymmetric. The sfx central region and right arm sequences are conserved, but left arm sequences are not. Here we constructed a symmetric sfx sequence, in which the sfx left arm sequence was changed to the inverted repeat of the right arm sequence and made artificial shufflon segments carrying symmetric sfx sequences in inverted or direct orientations. The symmetric sfx sequence exhibited the highest inversion frequency in a shufflon segment flanked by two inverted sfx sequences. Rci-dependent deletion of a shufflon segment flanked by two direct symmetric sfx sequences was observed, suggesting that asymmetry of R64 sfx sequences inhibits recombination between direct sfx sequences. In addition, intermolecular recombination between symmetric sfx sequences was also observed. The extra C-terminal domain of Rci was shown to be essential for inversion of the R64 shufflon using asymmetric sfx sequences but not essential for recombination using symmetric sfx sequences, suggesting that the Rci C-terminal segment helps the binding of Rci to asymmetric sfx sequences. Rci protein lacking the C-terminal domain bound to both arms of symmetric sfx sequence but only to the right arm of asymmetric sfx sequence.  相似文献   

7.
Tandem repeats occur frequently in biological sequences. They are important for studying genome evolution and human disease. A number of methods have been designed to detect a single tandem repeat in a sliding window. In this article, we focus on the case that an unknown number of tandem repeat segments of the same pattern are dispersively distributed in a sequence. We construct a probabilistic generative model for the tandem repeats, where the sequence pattern is represented by a motif matrix. A Bayesian approach is adopted to compute this model. Markov chain Monte Carlo (MCMC) algorithms are used to explore the posterior distribution as an effort to infer both the motif matrix of tandem repeats and the location of repeat segments. Reversible jump Markov chain Monte Carlo (RJMCMC) algorithms are used to address the transdimensional model selection problem raised by the variable number of repeat segments. Experiments on both synthetic data and real data show that this new approach is powerful in detecting dispersed short tandem repeats. As far as we know, it is the first work to adopt RJMCMC algorithms in the detection of tandem repeats.  相似文献   

8.
We describe here a family of foldback transposons found in the genome of the higher eucaryote, the sea urchin Strongylocentrotus purpuratus. Two major classes of TU elements have been identified by analysis of genomic DNA and TU element clones. One class consists of largely similar elements with long terminal inverted repeats (IVRs) containing outer and inner domains and sharing a common middle segment that can undergo deletions. Some of these elements contain insertions. The second class is highly heterogeneous, with many different middle segments nonhomologous to those of the first-class and variable-sized inverted repeats that contain only an outer domain. The middle and insertion segments of both classes carry sequences that also are found unassociated from the inverted repeats at many other genomic locations. We conclude that the TU elements are modular structures composed of inverted repeats plus other sequence domains that are themselves members of different families of dispersed repetitive sequences. Such modular elements may have a role in the dispersion and rearrangement of genomic DNA segments.  相似文献   

9.
The Ac-specific ORFa protein, overexpressed in a baculovirus system, specifically binds to several subterminal fragments of Ac. The 11 bp long inverted repeats of the transposable element are not bound by the ORFa protein. Major ORFa protein-binding sites were delineated on 60 and 70 bp long sequence segments that lie 100 bp inside of the 5' Ac terminus and 40 bp inside of the 3' terminus respectively. Within all strongly bound fragments, and particularly in these 60 or 70 bp long segments, the hexamer motif AAACGG is repeated several times in direct or inverted orientation. The ORFa protein binds to synthetic concatemers of this motif, whereas the mutant motif AAAGGG is not complexed. Methylation of the cytosine residues in the AAACGG motif and/or its complementary strand has pronounced effects: whereas one of the two hemimethylated sequences has a higher affinity to the ORFa protein than both unmethylated and holomethylated DNAs, the other hemimethylated DNA is virtually not complexed at all. The native ORFa protein binding sites are more complex than the AAACGG sequence: certain Ac and Ds1 fragments devoid of AAACGG motifs (but containing several similar sequences) are weakly bound by the ORFa protein.  相似文献   

10.
MOTIVATION: Characterization of a protein family by its distinct sequence domains is crucial for functional annotation and correct classification of newly discovered proteins. Conventional Multiple Sequence Alignment (MSA) based methods find difficulties when faced with heterogeneous groups of proteins. However, even many families of proteins that do share a common domain contain instances of several other domains, without any common underlying linear ordering. Ignoring this modularity may lead to poor or even false classification results. An automated method that can analyze a group of proteins into the sequence domains it contains is therefore highly desirable. RESULTS: We apply a novel method to the problem of protein domain detection. The method takes as input an unaligned group of protein sequences. It segments them and clusters the segments into groups sharing the same underlying statistics. A Variable Memory Markov (VMM) model is built using a Prediction Suffix Tree (PST) data structure for each group of segments. Refinement is achieved by letting the PSTs compete over the segments, and a deterministic annealing framework infers the number of underlying PST models while avoiding many inferior solutions. We show that regions of similar statistics correlate well with protein sequence domains, by matching a unique signature to each domain. This is done in a fully automated manner, and does not require or attempt an MSA. Several representative cases are analyzed. We identify a protein fusion event, refine an HMM superfamily classification into the underlying families the HMM cannot separate, and detect all 12 instances of a short domain in a group of 396 sequences. CONTACT: jill@cs.huji.ac.il; tishby@cs.huji.ac.il.  相似文献   

11.
Y Cai 《Journal of bacteriology》1991,173(18):5771-5777
IS892, one of the several insertion sequence (IS) elements discovered in Anabaena sp. strain PCC 7120 (Y. Cai and C. P. Wolk, J. Bacteriol. 172:3138-3145, 1990), is 1,675 bp with 24-bp near-perfect inverted terminal repeats and has two open reading frames (ORFs) that could code for proteins of 233 and 137 amino acids. Upon insertion into target sites, this IS generates an 8-bp directly repeated target duplication. A 32-bp sequence in the region between ORF1 and ORF2 is similar to the sequence of the inverted termini. Similar inverted repeats are found within each of those three segments, and the sequences of these repeats bear some similarity to the 11-bp direct repeats flanking the 11-kb insertion interrupting the nifD gene of this strain (J. W. Golden, S. J. Robinson, and R. Haselkorn, Nature [London] 314:419-423, 1985). A sequence similar to that of a binding site for the Escherichia coli integration host factor is found about 120 bp from the left end of IS892. Partial nucleotide sequences of active IS elements IS892N and IS892T, members of the IS892 family from the same Anabaena strain, were shown to be very similar to the sequence of IS892.  相似文献   

12.
Genetic diversity and relationship among 3 Sicilian horse breeds were investigated using 16 microsatellite markers and a 397-bp length mitochondrial D-loop sequence. The analysis of autosomal DNA was performed on 191 horses (80 Siciliano [SIC], 61 Sanfratellano [SAN], and 50 Sicilian Oriental Purebred [SOP]). SIC and SAN breeds were notably higher in genetic variability than the SOP. Genetic distances and cluster analysis showed a close relationship between SIC and SAN breeds, as expected according to the breeds' history. Sequencing of hypervariable mitochondrial DNA region was performed on a subset of 60 mares (20 for each breed). Overall, 20 haplotypes with 31 polymorphic sites were identified: A higher haplotype diversity was detected in SIC and SAN breeds, with 13 and 11 haplotypes respectively, whereas only one haplotype was found in SOP. These were compared with 118 sequences from GenBank. BLAST showed that 17 of the 20 haplotypes had been reported previously in other breeds. One haplotype, found in SIC, traces back to a Bronze Age archaeological site (Inner Mongolia). The 3 Sicilian breeds are now rare, and 2 of them are officially endangered. Our results represent a valuable tool for management strategies as well as for conservation purposes.  相似文献   

13.
Summary We cloned and sequenced a 402 by DNA segment containing the origin of conjugal transfer (oriT) of the IncW plasmid R388. Progressive deletions from each end of the sequence were assayed for oriT activity. Stepwise reductions in mobilization frequencies, representing the loss of functional elements, correlated with deletion of structural motifs in the sequence. A sequence of 330 by of oriT was sufficient for efficient mobilization. The first 86 by of the sequence contains five tandemly repeated DNA sequences of 11 bp, followed by a 10 by perfect inverted repeat. Deletion of the first 95 by reduced the frequency of transfer by a hundred-fold. The sequence between by 183 and 218 was necessary and sufficient for low frequency mobilization and, thus, it was assumed to contain the nick site. This basis core was cloned as a 60 by segment (from by 176–236) that could be mobilized at low frequency. It includes two inverted repeats and a perfect integration host factor (IHF) consensus binding site. A third functionally important segment in oriT was located between by 260 and 330. The DNA sequence of the oriT of R388 could be aligned with that of the broad-host-range IncN plasmid R46. Moreover, the relative positions of the three inverted repeats are also conserved. Overall sequence similarity was 52%, but was significantly higher in particular regions, whch coincided with the functionally important segments mapped by deletion analysis. Conservation of these segments provided independent support for their essential role in oriT function.  相似文献   

14.
The H circle of Leishmania species contains a 30 kb inverted duplication separated by two unique DNA segments, a and b. The corresponding H region of chromosomal DNA has only one copy of the duplicated DNA. We show here that the chromosomal segments a and b are flanked by inverted repeats (198 and 1241 bp) and we discuss how these repeats could lead to formation of H circles from chromosomal DNA. Selection of Leishmania tarentolae for methotrexate resistance indeed resulted in the de novo formation of circles with long inverted duplication, but two mutants selected for arsenite resistance contained new H region plasmids without such duplications. One of these plasmids appears due to a homologous recombination between two P-glycoprotein genes with a high degree of sequence homology. Our results show how the same DNA region in Leishmania may be amplified to give plasmids with or without long inverted duplications and apparently by different mechanisms.  相似文献   

15.
We applied a hidden Markov model segmentation method to the human mitochondrial genome to identify patterns in the sequence, to compare these patterns to the gene structure of mtDNA and to see whether these patterns reveal additional characteristics important for our understanding of genome evolution, structure and function. Our analysis identified three segmentation categories based upon the sequence transition probabilities. Category 2 segments corresponded to the tRNA and rRNA genes, with a greater strand-symmetry in these segments. Category 1 and 3 segments covered the protein- coding genes and almost all of the non-coding D-loop. Compared to category 1, the mtDNA segments assigned to category 3 had much lower guanine abundance. A comparison to two independent databases of mitochondrial mutations and polymorphisms showed that the high substitution rate of guanine in human mtDNA is largest in the category 3 segments. Analysis of synonymous mutations showed the same pattern. This suggests that this heterogeneity in the mutation rate is partly independent of respiratory chain function and is a direct property of the genome sequence itself. This has important implications for our understanding of mtDNA evolution and its use as a ‘molecular clock’ to determine the rate of population and species divergence.  相似文献   

16.
The chloroplast genome of a marine centric diatom,Odontella sinensis, was cloned and sequenced. The circular genome is 119,704 bp in length (AC=Z67753;). It contains an inverted repeat sequence of 7,725 bp separating two single-copy regions of 38,908 and 65,346 bp, respectively, and 174 genes and open reading frames, of which nine are duplicated within the inverted repeat segments.  相似文献   

17.
18.
When phage lambda lysogenizes a cell that lacks the primary bacterial attachment site, integrase catalyzes insertion of the phage chromosome into one of many secondary sites. Here, we characterize the secondary sites that are preferred by wild-type lambda and by lambda int mutants with altered insertion specificity. The sequences of these secondary sites resembled that of the primary site: they contained two imperfect inverted repeats flanking a short spacer. The imperfect inverted repeats of the primary site bind integrase, while the 7 bp spacer, or overlap region, swaps strands with a complementary sequence in the phage attachment site during recombination. We found substantial sequence conservation in the imperfect inverted repeats of secondary sites, and nearly perfect conservation in the leftmost three bases of the overlap region. By contrast, the rightmost bases of the overlap region were much more variable. A phage with an altered overlap region preferred to insert into secondary sites with the corresponding bases. We suggest that this difference between the left and right segments is a result of the defined order of strand exchanges during integrase-promoted recombination. This suggestion accounts for the unexpected segregation pattern of the overlap region observed after insertion into several secondary sites. Some of the altered specificity int mutants differed from wild-type in secondary site preference, but we were unable to identify simple sequence motifs that account for these differences. We propose that insertion into secondary sites is a step in the evolutionary change of phage insertion specificity and present a model of how this might occur.  相似文献   

19.
Inverted repeat sequences, capable of forming stable intra-chain foldback duplexes, are shown using electron microscopy to be located in over 90% of fragments of nuclear DNA from Physarum polycephalum. A statistical treatment of the data indicates that, on average, foldback sequence foci are spaced every 7,000 nucleotides and that they are distributed uniformly amongst the DNA chains. The majority of inverted repeat sequences give rise to the simple types of foldback structure observed in DNA from other eukaryotic species, but a significant proportion of the DNA fragments also contain novel foldback structures with a more complex appearance, referred to as 'bubbled' hairpins. The latter structures appear to be formed by the annealing of several distinct segments of homologous inverted repeat sequence, each separated by interspersed non-foldback sequences of variable sizes up to 15,000 nucleotides in length. The size, both of the foldback duplexes and of the intervening single-chain segments of DNA, are not random. Instead, they appear to form a regular, arithmetic series of lengths. These observations suggest that the different segments of Physarum DNA from which foldback structures are derived contain nucleotide sequences that share a highly ordered and unform pattern of structural organisation. These regular units of organisation in Physarum DNA in some cases extend over distances up to 50,000 nucleotides in length.  相似文献   

20.
The sequence organization of cloned segments of Human DNA carrying unusual domains of alphoid satellite was studied by restriction mapping, electron microscopy and base sequence analysis. In some cases restriction mapping revealed the absence of the typical 340 bp EcoR 1 dimer, although blot hybridizations showed the extensive presence of alphoid satellite. A variant monomeric construction was demonstrated by DNA sequencing. Furthermore, inverted repeats within these domains were detected by electron microscopy. In one case these were shown to be the result of interruptions in the satellite sequence by members of a family of repetitive, conserved elements.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号