首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
Selected and counterselected oligodeoxynucleotide sequences were identified in the total sequence of bacteriophage T7 DNA using a statistical criterion derived for a probability model of the Markov chain type. All extremely rare tetra- and pentadeoxynucleotides are (or contain) recognition sequences for the Escherichia coli DNA methylases dam or dcm. Most of the 37 hexadeoxynucleotides absent from T7 DNA are recognition sequences for type II modification/restriction enzymes of E. coli or related species. In contrast to most restriction sites counterselected during evolution, the EcoP1 site GGTCT occurs 126 times in the T7 genome, and phage T7 replication is severely repressed in P1-lysogenic host cells. We demonstrate that the frequency of the EcoP1 site is determined by that of the overlapping recognition sites for T7 primase, an essential phage enzyme. The recognition site of a type III enzyme, EcoP15, is also not counterselected. In T7 DNA all 36 EcoP15 sites are arranged in such a manner that the sequence CAGCAG is confined to the H strand, the complementary sequence CTGCTG to the L strand. This "strand bias" is highly significant and, therefore, very probably selected. A functional relation between this strand bias and the refractive behaviour of phage T7 to EcoP15 restriction is suspected.  相似文献   

2.
I show that the recognition sequences of Type II restriction systems are correlated with the G + C content of the host bacterial DNA. Almost all restriction systems with G + C rich tetranucleotide recognition sequences are found in species with A + T rich genomes, whereas G + C rich hexanucleotide and octanucleotide recognition sequences are found almost exclusively in species with G + C rich genomes. Most hexanucleotide recognition sequences found in species with A + T rich genomes are A + T rich. This distribution eliminates a substantial proportion of the potential variance in the frequency of restriction recognition sequences in the host genomes. As a consequence, almost all restriction recognition sequences, including those eight base pairs in length (Not I and Sfi I), are predicted to occur with a frequency ranging from once every 300 to once every 5,000 base pairs in the host genome. Since the G + C content of bacteriophage DNA and of the host genome are also correlated, the data presented is evidence that most Type II "restriction systems" are indeed involved in phage restriction.  相似文献   

3.
Ab initio gene identification in metagenomic sequences   总被引:1,自引:0,他引:1  
We describe an algorithm for gene identification in DNA sequences derived from shotgun sequencing of microbial communities. Accurate ab initio gene prediction in a short nucleotide sequence of anonymous origin is hampered by uncertainty in model parameters. While several machine learning approaches could be proposed to bypass this difficulty, one effective method is to estimate parameters from dependencies, formed in evolution, between frequencies of oligonucleotides in protein-coding regions and genome nucleotide composition. Original version of the method was proposed in 1999 and has been used since for (i) reconstructing codon frequency vector needed for gene finding in viral genomes and (ii) initializing parameters of self-training gene finding algorithms. With advent of new prokaryotic genomes en masse it became possible to enhance the original approach by using direct polynomial and logistic approximations of oligonucleotide frequencies, as well as by separating models for bacteria and archaea. These advances have increased the accuracy of model reconstruction and, subsequently, gene prediction. We describe the refined method and assess its accuracy on known prokaryotic genomes split into short sequences. Also, we show that as a result of application of the new method, several thousands of new genes could be added to existing annotations of several human and mouse gut metagenomes.  相似文献   

4.
Restriction enzymes produced by bacteria serve as a defense against invading bacteriophages, and so phages without other protection would be expected to undergo selection to eliminate recognition sites for these enzymes from their genomes. The observed frequencies of all restriction sites in the genomes of all completely sequenced DNA phages (T7, lambda, phi X174, G4, M13, f1, fd, and IKe) have been compared to expected frequencies derived from trinucleotide frequencies. Attention was focused on 6-base palindromes since they comprise the typical recognition sites for type II restriction enzymes. All of these coliphages, with the exception of lambda and G4, exhibit significant avoidance of the particular sequences that are enterobacterial restriction sites. As expected, the sequenced fraction of the genome of phi 29, a Bacillus subtilis phage, lacks Bacillus restriction sites. By contrast, the RNA phage MS2, several viruses that infect eukaryotes (EBV, adenovirus, papilloma, and SV40), and three mitochondrial genomes (human, mouse, and cow) were found not to lack restriction sites. Because the particular palindromes avoided correspond closely with the recognition sites for host enzymes and because other viruses and small genomes do not show this avoidance, it is concluded that the effect indeed results from natural selection.   相似文献   

5.
Fast algorithms for large-scale genome alignment and comparison   总被引:35,自引:5,他引:30       下载免费PDF全文
We describe a suffix-tree algorithm that can align the entire genome sequences of eukaryotic and prokaryotic organisms with minimal use of computer time and memory. The new system, MUMmer 2, runs three times faster while using one-third as much memory as the original MUMmer system. It has been used successfully to align the entire human and mouse genomes to each other, and to align numerous smaller eukaryotic and prokaryotic genomes. A new module permits the alignment of multiple DNA sequence fragments, which has proven valuable in the comparison of incomplete genome sequences. We also describe a method to align more distantly related genomes by detecting protein sequence homology. This extension to MUMmer aligns two genomes after translating the sequence in all six reading frames, extracts all matching protein sequences and then clusters together matches. This method has been applied to both incomplete and complete genome sequences in order to detect regions of conserved synteny, in which multiple proteins from one organism are found in the same order and orientation in another. The system code is being made freely available by the authors.  相似文献   

6.
In sequenced genomes of prokaryotes, anomalous DNA (aDNA) can be recognized, among others, by atypical clustering of dinucleotides. We hypothesized that atypical clustering of hexameric endonuclease recognition sites in aDNA allows the specific isolation of anomalous sequences in vitro. Clustering of endonuclease recognition sites in aDNA regions of eight published prokaryotic genome sequences was demonstrated. In silico digestion of the Neisseria meningitidis MC58 genome, using four selected endonucleases, revealed that out of 27 of the small fragments predicted (<5 kb), 21 were located in known genomic islands. Of the 24 calculated fragments (>300 bp and <5 kb), 22 met our criteria for aDNA, i.e. a high dinucleotide dissimilarity and/or aberrant GC content. The four enzymes also allowed the identification of aDNA fragments from the related Z2491 strain. Similarly, the sequenced genomes of three strains of Escherichia coli assessed by in silico digestion using XbaI yielded strain-specific sets of fragments of anomalous composition. In vitro applicability of the method was demonstrated by using adaptor-linked PCR, yielding the predicted fragments from the N.meningitidis MC58 genome. In conclusion, this strategy allows the selective isolation of aDNA from prokaryotic genomes by a simple restriction digest–amplification–cloning–sequencing scheme.  相似文献   

7.
Studies of neutrally evolving sequences suggest that differences in eukaryotic genome sizes result from different rates of DNA loss. However, very few pseudogenes have been identified in microbial species, and the processes whereby genes and genomes deteriorate in bacteria remain largely unresolved. The typhus-causing agent, Rickettsia prowazekii, is exceptional in that as much as 24% of its 1.1-Mb genome consists of noncoding DNA and pseudogenes. To test the hypothesis that the noncoding DNA in the R. prowazekii genome represents degraded remnants of ancestral genes, we systematically examined all of the identified pseudogenes and their flanking sequences in three additional Rickettsia species. Consistent with the hypothesis, we observe sequence similarities between genes and pseudogenes in one species and intergenic DNA in another species. We show that the frequencies and average sizes of deletions are larger than insertions in neutrally evolving pseudogene sequences. Our results suggest that inactivated genetic material in the Rickettsia genomes deteriorates spontaneously due to a mutation bias for deletions and that the noncoding sequences represent DNA in the final stages of this degenerative process.  相似文献   

8.
MOTIVATION: One of the major features of genomic DNA sequences, distinguishing them from texts in most spoken or artificial languages, is their high repetitiveness. Variation in the repetitiveness of genomic texts reflects the presence and density of different biologically important messages. Thus, deviation from an expected number of repeats in both directions indicates a possible presence of a biological signal. Linguistic complexity corresponds to repetitiveness of a genomic text, and potential regulatory sites may be discovered through construction of typical patterns of complexity distribution. RESULTS: We developed software for fast calculation of linguistic sequence complexity of DNA sequences. Our program utilizes suffix trees to compute the number of subwords present in genomic sequences, thereby allowing calculation of linguistic complexity in time linear in genome size. The measure of linguistic complexity was applied to the complete genome of Haemophilus influenzae. Maps of complexity along the entire genome were obtained using sliding windows of 40, 100, and 2000 nucleotides. This approach provided an efficient way to detect simple sequence repeats in this genome. In addition, local profiles of complexity distribution around the starts of translation were constructed for 21 complete prokaryotic genomes. We hypothesize that complexity profiles correspond to evolutionary relationships between organisms. We found principal differences in profiles of the GC-rich and other (non-GC-rich) genomes. We also found characteristic differences in profiles of AT genomes, which probably reflect individual species variations in translational regulation. AVAILABILITY: The program is available upon request from Alexander Bolshoy or at http://csweb.haifa.ac.il/library/#complex.  相似文献   

9.
10.
Pairwise local sequence alignment methods have been the prevailing technique to identify homologous nucleotides between related species. However, existing methods that identify and align all homologous nucleotides in one or more genomes have suffered from poor scalability and limited accuracy. We propose a novel method that couples a gapped extension heuristic with an efficient filtration method for identifying interspersed repeats in genome sequences. During gapped extension, we use the MUSCLE implementation of progressive global multiple alignment with iterative refinement. The resulting gapped extensions potentially contain alignments of unrelated sequence. We detect and remove such undesirable alignments using a hidden Markov model (HMM) to predict the posterior probability of homology. The HMM emission frequencies for nucleotide substitutions can be derived from any time-reversible nucleotide substitution matrix. We evaluate the performance of our method and previous approaches on a hybrid data set of real genomic DNA with simulated interspersed repeats. Our method outperforms a related method in terms of sensitivity, positive predictive value, and localizing boundaries of homology. The described methods have been implemented in freely available software, Repeatoire, available from: http://wwwabi.snv.jussieu.fr/public/Repeatoire.  相似文献   

11.
In this paper, we review developments in probabilistic methods of gene recognition in prokaryotic genomes with the emphasis on connections to the general theory of hidden Markov models (HMM). We show that the Bayesian method implemented in GeneMark, a frequently used gene-finding tool, can be augmented and reintroduced as a rigorous forward-backward (FB) algorithm for local posterior decoding described in the HMM theory. Another earlier developed method, prokaryotic GeneMark.hmm, uses a modification of the Viterbi algorithm for HMM with duration to identify the most likely global path through hidden functional states given the DNA sequence. GeneMark and GeneMark.hmm programs are worth using in concert for analysing prokaryotic DNA sequences that arguably do not follow any exact mathematical model. The new extension of GeneMark using the FB algorithm was implemented in the software program GeneMark.fba. Given the DNA sequence, this program determines an a posteriori probability for each nucleotide to belong to coding or non-coding region. Also, for any open reading frame (ORF), it assigns a score defined as a probabilistic measure of all paths through hidden states that traverse the ORF as a coding region. The prediction accuracy of GeneMark.fba determined in our tests was compared favourably to the accuracy of the initial (standard) GeneMark program. Comparison to the prokaryotic GeneMark.hmm has also demonstrated a certain, yet species-specific, degree of improvement in raw gene detection, ie detection of correct reading frame (and stop codon). The accuracy of exact gene prediction, which is concerned about precise prediction of gene start (which in a prokaryotic genome unambiguously defines the reading frame and stop codon, thus, the whole protein product), still remains more accurate in GeneMarkS, which uses more elaborate HMM to specifically address this task.  相似文献   

12.
Whole genome shotgun sequence analysis has become the standard method for beginning to determine a genome sequence. The preparation of the shotgun sequence clones is, in fact, a biological experiment. It determines which segments of the genome can be cloned into Escherichia coli and which cannot. By analyzing the complete set of sequences from such an experiment, it is possible to identify genes lethal to E. coli. Among this set are genes encoding restriction enzymes which, when active in E. coli, lead to cell death by cleaving the E. coli genome at the restriction enzyme recognition sites. By analyzing shotgun sequence data sets we show that this is a reliable method to detect active restriction enzyme genes in newly sequenced genomes, thereby facilitating functional annotation. Active restriction enzyme genes have been identified, and their activity demonstrated biochemically, in the sequenced genomes of Methanocaldococcus jannaschii, Bacillus cereus ATCC 10987 and Methylococcus capsulatus.  相似文献   

13.
Statistical analysis of nucleotide sequences.   总被引:5,自引:4,他引:1       下载免费PDF全文
In order to scan nucleic acid databases for potentially relevant but as yet unknown signals, we have developed an improved statistical model for pattern analysis of nucleic acid sequences by modifying previous methods based on Markov chains. We demonstrate the importance of selecting the appropriate parameters in order for the method to function at all. The model allows the simultaneous analysis of several short sequences with unequal base frequencies and Markov order k not equal to 0 as is usually the case in databases. As a test of these modifications, we show that in E. coli sequences there is a bias against palindromic hexamers which correspond to known restriction enzyme recognition sites.  相似文献   

14.
IIB型限制内切酶能够识别并切割特异酶切位点两端特定距离的DNA,形成粘性末端的30 bp左右的等长DNA片段。利用其特性与限制性酶切位点关联测序技术(RAD)相结合发展出2b-RAD简化基因组测序技术,应用于遗传图谱构建、种群遗传结构分析、性状定位以及细菌分型等多种研究领域。构建2b-RAD测序文库之前,需要对基因组中的IIB型限制内切酶位点进行预测与统计分析,制定有效的测序文库构建方案。本文利用Python语言构建分析基因组中IIB型限制内切酶位点的流程,预测并统计6个鳞翅目代表物种基因组含有的8个商业化IIB型限制内切酶的酶切位点,比较了各个基因组与IIB型限制内切酶之间含有的酶切位点总量、重复序列数量以及酶切间隔长度的关系,为在昆虫基因组中进一步试行2b-RAD研究提供了参考。  相似文献   

15.
A Markov analysis of DNA sequences   总被引:12,自引:0,他引:12  
We present a model by which we look at the DNA sequence as a Markov process. It has been suggested by several workers that some basic biological or chemical features of nucleic acids stand behind the frequencies of dinucleotides (doublets) in these chains. Comparing patterns of doublet frequencies in DNA of different organisms was shown to be a fruitful approach to some phylogenetic questions (Russel & Subak-Sharpe, 1977). Grantham (1978) formulated mRNA sequence indices, some of which involve certain doublet frequencies. He suggested that using these indices may provide indications of the molecular constraints existing during gene evolution. Nussinov (1981) has shown that a set of dinucleotide preference rules holds consistently for eukaryotes, and suggested a strong correlation between these rules and degenerate codon usage. Gruenbaum, Cedar & Razin (1982) found that methylation in eukaryotic DNA occurs exclusively at C-G sites. Important biological information thus seems to be contained in the doublet frequencies. One of the basic questions to be asked (the "correlation question") is to what extent are the 64 trinucleotide (triplet) frequencies measured in a sequence determined by the 16 doublet frequencies in the same sequence. The DNA is described here as a Markov process, with the nucleotides being outcomes of a sequence generator. Answering the correlation question mentioned above means finding the order of the Markov process. The difficulty is that natural sequences are of finite length, and statistical noise is quite strong. We show that even for a 16000 nucleotide long sequence (like that of the human mitochondrial genome) the finite length effect cannot be neglected. Using the Markov chain model, the correlation between doublet and triplet frequencies can, however, be determined even for finite sequences, taking proper account of the finite length. Two natural DNA sequences, the human mitochondrial genome and the SV40 DNA, are analysed as examples of the method.  相似文献   

16.
M Rosenberg  S Segal  E L Kuff  M F Singer 《Cell》1977,11(4):845-857
DNA fragments containing monkey DNA sequences have been isolated from defective SV40 genomes that carry host sequences in place of portions of the SV40 genome. The fragments were isolated by restriction endonuclease cleavage and contain segments homologous to sequences in both the highly repetitive and unique (or less repetitive) classes of monkey DNA. The complete nucleotide sequence of one such fragment [151 base pairs (bp)] predominantly homologous to the highly reiterated class of monkey DNA was determined using both RNA and DNA sequencing methods. The nucleotide sequence of this homogeneous DNA segment does not contain discernible multiple internal repeating units but only a few short oligonucleotide repeats. The reiteration frequency of the sequence in the monkey genome is >106. Digestion of total monkey DNA (from uninfected cells) with endonuclease R Hind III produces relatively large amounts of discrete DNA fragments that contain extensive regions homologous to the fragment isolated from the defective SV40 DNA.A second fragment, also containing monkey sequences, was isolated from the same defective substituted SV40 genome. The nucleotide sequence of the 33 bp of this second fragment that are contiguous to the 151 bp fragment has also been determined.The sequences in both fragments are also present in other, independently derived, defective substituted SV40 genomes.  相似文献   

17.
The amino acid sequence of mammalian DNA methyltransferase has been deduced from the nucleotide sequence of a cloned cDNA. It appears that the mammalian enzyme arose during evolution via fusion of a prokaryotic restriction methyltransferase gene and a second gene of unknown function. Mammalian DNA methyltransferase currently comprises an N-terminal domain of about 1000 amino acids that may have a regulatory role and a C-terminal 570 amino acid domain that retains similarities to bacterial restriction methyltransferases. The sequence similarities among mammalian and bacterial DNA cytosine methyltransferases suggest a common evolutionary origin. DNA methylation is uncommon among those eukaryotes having genomes of less than 10(8) base pairs, but nearly universal among large-genome eukaryotes. This and other considerations make it likely that sequence inactivation by DNA methylation has evolved to compensate for the expansion of the genome that has accompanied the development of higher plants and animals. As methylated sequences are usually propagated in the repressed, nuclease-insensitive state, it is likely that DNA methylation compartmentalizes the genome to facilitate gene regulation by reducing the total amount of DNA sequence that must be scanned by DNA-binding regulatory proteins. DNA methylation is involved in immune recognition in bacteria but appears to regulate the structure and expression of the genome in complex higher eukaryotes. I suggest that the DNA-methylating system of mammals was derived from that of bacteria by way of a hypothetical intermediate that carried out selective de novo methylation of exogenous DNA and propagated the methylated DNA in the repressed state within its own genome.(ABSTRACT TRUNCATED AT 250 WORDS)  相似文献   

18.

Background

The increasing number of sequenced prokaryotic genomes contains a wealth of genomic data that needs to be effectively analysed. A set of statistical tools exists for such analysis, but their strengths and weaknesses have not been fully explored. The statistical methods we are concerned with here are mainly used to examine similarities between archaeal and bacterial DNA from different genomes. These methods compare observed genomic frequencies of fixed-sized oligonucleotides with expected values, which can be determined by genomic nucleotide content, smaller oligonucleotide frequencies, or be based on specific statistical distributions. Advantages with these statistical methods include measurements of phylogenetic relationship with relatively small pieces of DNA sampled from almost anywhere within genomes, detection of foreign/conserved DNA, and homology searches. Our aim was to explore the reliability and best suited applications for some popular methods, which include relative oligonucleotide frequencies (ROF), di- to hexanucleotide zero'th order Markov methods (ZOM) and 2.order Markov chain Method (MCM). Tests were performed on distant homology searches with large DNA sequences, detection of foreign/conserved DNA, and plasmid-host similarity comparisons. Additionally, the reliability of the methods was tested by comparing both real and random genomic DNA.

Results

Our findings show that the optimal method is context dependent. ROFs were best suited for distant homology searches, whilst the hexanucleotide ZOM and MCM measures were more reliable measures in terms of phylogeny. The dinucleotide ZOM method produced high correlation values when used to compare real genomes to an artificially constructed random genome with similar %GC, and should therefore be used with care. The tetranucleotide ZOM measure was a good measure to detect horizontally transferred regions, and when used to compare the phylogenetic relationships between plasmids and hosts, significant correlation (R 2 = 0.4) was found with genomic GC content and intra-chromosomal homogeneity.

Conclusion

The statistical methods examined are fast, easy to implement, and powerful for a number of different applications involving genomic sequence comparisons. However, none of the measures examined were superior in all tests, and therefore the choice of the statistical method should depend on the task at hand.  相似文献   

19.
A new method to improve the efficiency of flanking sequence identification by genome walking was developed based on an expanded, sequential list of criteria for selecting candidate enzymes, plus several other optimization steps. These criteria include: step (1) initially choosing the most appropriate restriction enzyme according to the average fragment size produced by each enzyme determined using in silico digestion of genomic DNA, step (2) evaluating the in silico frequency of fragment size distribution between individual chromosomes, step (3) selecting those enzymes that generate fragments with the majority between 100 bp and 3,000 bp, step (4) weighing the advantages and disadvantages of blunt-end sites vs. cohesive-end sites, step (5) elimination of methylation sensitive enzymes with methylation-insensitive isoschizomers, and step (6) elimination of enzymes with recognition sites within the binary vector sequence (T-DNA and plasmid backbone). Step (7) includes the selection of a second restriction enzyme with highest number of recognition sites within regions not covered by the first restriction enzyme. Step (8) considers primer and adapter sequence optimization, selecting the best adapter-primer pairs according to their hairpin/dimers and secondary structure. In step (9), the efficiency of genomic library development was improved by column-filtration of digested DNA to remove restriction enzyme and phosphatase enzyme, and most important, to remove small genomic fragments (<100 bp) lacking the T-DNA insertion, hence improving the chance of ligation between adapters and fragments harbouring a T-DNA. Two enzymes, NsiI and NdeI, fit these criteria for the Arabidopsis thaliana genome. Their efficiency was assessed using 54 T(3) lines from an Arabidopsis SK enhancer population. Over 70% success rate was achieved in amplifying the flanking sequences of these lines. This strategy was also tested with Brachypodium distachyon to demonstrate its applicability to other larger genomes.  相似文献   

20.
Eucaryotic transposable genetic elements with inverted terminal repeats   总被引:22,自引:0,他引:22  
S Potter  M Truett  M Phillips  A Maher 《Cell》1980,20(3):639-647
DNA carrying inverted repeats was tested for transposition within the Drosophila genome. Five Bam HI segments containing related inverted repeats were isolated from D. melanogaster and analyzed by electron microscopy and restriction mapping. Southern blot experiments using single-copy flanking sequences as probes allowed the study of DNA arrangements at specific sites in the genomes of five closely related strains. We found that in some genomes the sequences with inverted repeats were present at a particular site, whereas in other genomes they were absent from this site. These results indicated that three of the sequences are transposable genetic elements. In one case we have purified the two corresponding DNA segments, with and without the sequence containing inverted repeats, thereby confirming the mobility of this sequence. These DNA elements were found to be distinct in two ways from copia and others previously described: first, they contain inverted terminal repeats, and second, they have a more heterogeneous construction.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号