首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
The nucleotide sequence of the bacteriophage φX174 contains surprisingly few restriction endonuclease recognition sites. The observed frequency of those sites which consist of a six-nucleotide palindromic sequence would occur by chance with a probability of less than 8 × 10?5. The genome of φX174 does not contain the four-nucleotide palindromic recognition site for the enzyme MboI and this finding has a probability of only 1·7 × 10?9. A further analysis of the nucleotide sequence revealed that there is a marked scarcity of palindromic sequences with a length of four or six nucleotides while all other palindromes occur with a frequency close to that dictated by chance. A preliminary analysis of the genomes of other DNA viruses indicated that this palindrome avoidance is associated with single-stranded viruses only. The reasons for the paucity of these short even-numbered palindromes remains to be determined.  相似文献   

2.
3.
This paper analyzes the nucleotide sequences of three viruses: Kunjin, west Nile, and yellow fever. Each virus has one long open reading frame of greater than 10,200 nucleotides that codes for four structural and seven nonstructural genes. The Kunjin and west Nile viruses are the most closely related pair, when assessed on the basis of matches between their nucleotide sequences. As would be expected, the matching is least for bases at third-position codon sites and is greatest for second-position sites. Statistics are presented for the numbers of mismatches that are transitions or transversions. Nucleotide base usage is also reported. To each of the 33 virus-gene segments, nonhomogeneous Markov chain models have been fitted to describe the sequences of nucleotide bases. The models allow for different transition probabilities ("transition" is used in the mathematical sense here) and for different degrees of dependency, at the three sites in the codons. Reasonably satisfactory fits can be obtained for many of the genes by using models that are first order for both first- and second-position sites in the codon but that are second order for third-position sites. One consequence of such a model is that the correlation between one amino acid and the next is limited to the correlation of the last base of the former with the first base of the latter. Other consequences are that the model can (and does) prohibit the occurrence of stop codons within a gene and that subsequences of only first-position bases, or only third-position bases, are also first-order Markov chains. In theory, second-position subsequences may not be Markov chains at all. In practice, the data suggest that each of these subsequences is effectively a zero-order Markov chain, i.e., bases spaced three apart are statistically independent. Stationarity of nucleotide base distributions can be interpreted in either of two ways: (1) spatially along the sites or (2) temporally at each site. These interpretations must often be inconsistent, when the former allows for Markov dependence between adjacent sites whereas the latter assumes independence between sites. The inconsistency can be overcome, for these viruses, if subsequences at different codon positions are analyzed separately.  相似文献   

4.
MOTIVATION: The global alignment of protein sequence pairs is often used in the classification and analysis of full-length sequences. The calculation of a Z-score for the comparison gives a length and composition corrected measure of the similarity between the sequences. However, the Z-score alone, does not indicate the likely biological significance of the similarity. In this paper, all pairs of domains from 250 sequences belonging to different SCOP folds were aligned and Z-scores calculated. The distribution of Z-scores was fitted with a peak distribution from which the probability of obtaining a given Z-score from the global alignment of two protein sequences of unrelated fold was calculated. A similar analysis was applied to subsequence pairs found by the Smith-Waterman algorithm. These analyses allow the probability that two protein sequences share the same fold to be estimated by global sequence alignment. RESULTS: The relationship between Z-score and probability varied little over the matrix/gap penalty combinations examined. However, an average shift of +4.7 was observed for Z-scores derived from global alignment of locally-aligned subsequences compared to global alignment of the full-length sequences. This shift was shown to be the result of pre-selection by local alignment, rather than any structural similarity in the subsequences. The search ability of both methods was benchmarked against the SCOP superfamily classification and showed that global alignment Z-scores generated from the entire sequence are as effective as SSEARCH at low error rates and more effective at higher error rates. However, global alignment Z-scores generated from the best locally-aligned subsequence were significantly less effective than SSEARCH. The method of estimating statistical significance described here was shown to give similar values to SSEARCH and BLAST, providing confidence in the significance estimation. AVAILABILITY: Software to apply the statistics to global alignments is available from http://barton.ebi.ac.uk. CONTACT: geoff@ebi.ac.uk  相似文献   

5.
W Saurin  P Marlière 《Biochimie》1985,67(5):517-521
A set of sequences can be defined by their common subsequences, and the length of these is a measure of the overall resemblance of the set. Each subsequence corresponds to a succession of symbols embedded in every sequence, following the same order but not necessarily contiguous. Determining the longest common subsequence (LCS) requires the exhaustive testing of all possible common subsequences, which sum up to about 2L, if L is the length of the shortest sequence. We present a polynomial algorithm (O(n X L4), where n is the number of sequences) for generating strings related to the LCS and constructed with the sequence alphabet and an indetermination symbol. Such strings are iteratively improved by deleting indetermination symbols and concomitantly introducing the greatest number of alphabet symbols. Processed accordingly, nucleic acid and protein sequences lead to key-words encompassing the salient positions of homologous chains, which can be used for aligning or classifying them, as well as for finding related sequences in data banks.  相似文献   

6.
在以前工作中我们从人精子中分离纯化出一种与生育有关的糖蛋白,命名为BS-17。本文用其多克隆抗血清从人睾丸λgt11cDNA表达库中克隆了编码BS-17的cDNA片段。序列分析表明BS-17cDNA片段长791bp,开放阅读框架558bp,可编码186个氨基酸。经数据库检索,该cDNA片段与人Calpastatin(Ca ̄(2+)依赖的半胱氨酸蛋白酶calpain抑制剂)基因3’端顺序具有99.7%的同源,与Calpastatin蛋白质羧基末端同源性为99.5%。用cRNA进行组织原位杂交结果表明,BS-17基因表达于人精子减数分裂后期单倍精细胞阶段。  相似文献   

7.
Overlapping subsequences in a DNA sequence are not independent even if independence is supposed for the single nucleotides. Therefore the often used geometric distribution for the length of restriction fragments is not exact. The exact distribution of this random variable is derived for non-overlapping restriction sites in a DNA sequence with an infinite (or very large) number of nucleotides. Correction to the finite case is easy. It is shown that the simple geometric distribution is a good approximation as long as the basic probability for the occurrence of the recognition sequence at a given site is small.  相似文献   

8.
The temperature dependence of the T4 DNA ligase-catalyzed joining of plasmid DNA linearized by the action of HaeIII, EcoRI and PstI restriction endonucleases has been investigated by electron microscopy analysis. The extent of joining is maximal at 4 degrees and decreases with increasing temperatures following sigmoid-like curves. The temperature at which 50% of the maximal reaction is still observable increases going from DNA termini without single-stranded overlaps (produced by HaeIII) to termini with four nucleotides overlap, composed only by two A and two T (produced by EcoRI) to termini with four nucleotide overlap, composed by A, T, G and C (produced by PstI).  相似文献   

9.
The complete nucleotide sequence of the genome of the circular single-stranded DNA (isometric) phage alpha 3 has been determined and compared with that of the related phages phi X174 and G4. The alpha 3 genome consists of 6087 nucleotides, which is 701 nucleotides longer than the nucleotide sequence of the phi X174 genome and 510 nucleotides more than that of the G4 genome. The results demonstrated that the three phage species have 11 homologous genes (A, A*, B, C, K, D, E, J, F, G and H), the order of which is fundamentally identical, suggesting that they have evolved from a common ancestor. The sequence of some genes and untranslated intergenic regions, however, differs significantly from phage to phage: for example, the degree of amino acid sequence homology of the gene product is averaged at 47.7% between alpha 3 and phi X174 and 46.9% between alpha 3 and G4, and alpha 3 has a remarkable longer intergenic region composed of 758 nucleotides between the genes H and A compared with the counterparts of phi X174 and G4. Meanwhile, in vivo experiments of genetic complementation showed that alpha 3 can use none of the gene products of phi X174 and G4, whereas the related phage phi K can rescue alpha 3 nonsense mutants of the genes B, C, D and J. These sequencing and in vivo rescue results indicated that alpha 3 is closely related to phi K, but distantly remote from phi X174 or G4, and supported an evolutional hypothesis which has been so far proposed that the isometric phages are classified into three main groups: the generic representatives are phi X174, G4 and alpha 3.  相似文献   

10.
Diversity of T cell receptor (TCR) genes is primarily generated by nucleotide insertions upon rearrangement from their germ line-encoded V, D and J segments. Nucleotide insertions at V-D and D-J junctions are random, but some small subsets of these insertions are exceptional, in that one to three base pairs inversely repeat the sequence of the germline DNA. These short complementary palindromic sequences are called P nucleotides. We apply the ImmunoSeq deep-sequencing assay to the third complementarity determining region (CDR3) of the β chain of T cell receptors, and use the resulting data to study P nucleotides in the repertoire of naïve and memory CD8+ and CD4+ T cells. We estimate P nucleotide distributions in a cross section of healthy adults and different T cell subtypes. We show that P nucleotide frequency in all T cell subtypes ranges from 1% to 2%, and that the distribution is highly biased with respect to the coding end of the gene segment. Classification of observed palindromic sequences into P nucleotides using a maximum conditional probability model shows that single base P nucleotides are very rare in VDJ recombination; P nucleotides are primarily two bases long. To explore the role of P nucleotides in thymic selection, we compare P nucleotides in productive and non-productive sequences of CD8+ naïve T cells. The naïve CD8+ T cell clones with P nucleotides are more highly expanded.  相似文献   

11.
We investigate a statistical model for multidimensional epistasis. The genotype is devided into subsequences, and within each subsequence mutations which occur in a prescribed order are beneficial. The bit-string model used to represent the genotype, may be cast in the form of a ferromagnetic Ising model with a staggered field. We describe the actual correlations between mutations at different sites, within an equilibrium population at a given tolerance, which we define to be the temperature of the statistical ensemble.  相似文献   

12.
The 3-base periodicity, identified as a pronounced peak at the frequency N/3 (N is the length of the DNA sequence) of the Fourier power spectrum of protein coding regions, is used as a marker in gene-finding algorithms to distinguish protein coding regions (exons) and noncoding regions (introns) of genomes. In this paper, we reveal the explanation of this phenomenon which results from a nonuniform distribution of nucleotides in the three coding positions. There is a linear correlation between the nucleotide distributions in the three codon positions and the power spectrum at the frequency N/3. Furthermore, this study indicates the relationship between the length of a DNA sequence and the variance of nucleotide distributions and the average Fourier power spectrum, which is the noise signal in gene-finding methods. The results presented in this paper provide an efficient way to compute the Fourier power spectrum at N/3 and the noise signal in gene-finding methods by calculating the nucleotide distributions in the three codon positions.  相似文献   

13.
Cytological evidence indicates that genetic interference can be partitioned into two empirical components: nonrandomness in the number of chiasmata that occur and nonrandomness in the locations of multiple chiasmata. Previous studies have incorporated the first effect into genetic models for analyzing multipoint data. An extension to this approach is presented which allows for the second component of interference by modeling the probability density function of the locations of multiple crossovers. Results of reanalyses of multilocus data for the Drosophila X chromosome show that models that incorporate only the first effect give a better fit to these data than do standard mapping functions and that the extended model significantly improved the fit by decreasing the predicted frequency of multiple crossovers in nearby regions. Our results demonstrate that chiasma-based models of multilocus recombination, which are unique in incorporating direct estimates of the frequency of multiple crossovers for a chromosome region, can provide a powerful and realistic means of accounting for genetic interference when applied to the problems of gene localization, locus ordering, and exclusion mapping.  相似文献   

14.
One of the major goals of comparative genomics is to understand the evolutionary history of each nucleotide in the human genome sequence, and the degree to which it is under selective pressure. Ascertainment of selective constraint at nucleotide resolution is particularly important for predicting the functional significance of human genetic variation and for analyzing the sequence substructure of cis-regulatory sequences and other functional elements. Current methods for analysis of sequence conservation are focused on delineation of conserved regions comprising tens or even hundreds of consecutive nucleotides. We therefore developed a novel computational approach designed specifically for scoring evolutionary conservation at individual base-pair resolution. Our approach estimates the rate at which each nucleotide position is evolving, computes the probability of neutrality given this rate estimate, and summarizes the result in a Sequence CONservation Evaluation (SCONE) score. We computed SCONE scores in a continuous fashion across 1% of the human genome for which high-quality sequence information from up to 23 genomes are available. We show that SCONE scores are clearly correlated with the allele frequency of human polymorphisms in both coding and noncoding regions. We find that the majority of noncoding conserved nucleotides lie outside of longer conserved elements predicted by other conservation analyses, and are experiencing ongoing selection in modern humans as evident from the allele frequency spectrum of human polymorphism. We also applied SCONE to analyze the distribution of conserved nucleotides within functional regions. These regions are markedly enriched in individually conserved positions and short (<15 bp) conserved “chunks.” Our results collectively suggest that the majority of functionally important noncoding conserved positions are highly fragmented and reside outside of canonically defined long conserved noncoding sequences. A small subset of these fragmented positions may be identified with high confidence.  相似文献   

15.
Patchwork structure of a bovine satellite DNA   总被引:25,自引:0,他引:25  
M Pech  R E Streeck  H G Zachau 《Cell》1979,18(3):883-893
According to a previous restriction nuclease analysis, bovine 1.706 satellite DNA (density 1.706 g/cm3 in CsCl) is organized in an unusual structure of superimposed long- and short-range repeats (Streeck and Zachau, 1978). We have now determined the nucleotide sequence of this satellite DNA in both cloned fragments and fragments from the total satellite DNA. Each long-range repeat unit (about 2350 bp) is divided into four segments. Each segment consists of different variants of a basic 23 bp sequence which is itself composed of a dodecanucleotide and a related undecanucleotide. A total of 2400 nucleotides have been sequenced. Detailed analysis of the sequence divergence reveals that both the overall extent of divergence and the frequency of base changes at individual positions of the 23 bp repeats are characteristically different in the various segments. Preferentially methylated sites and a high incidence of symmetry elements are found. In two of the four segments, 22 of 23 bp of the prototype sequence are included in six overlapping elements of dyad symmetry and in a palindrome. A scheme for the evolution of the satellite DNA from a basic dodecanucleotide is proposed which is based on the different degrees of divergence for the various repeats superimposed in this satellite DNA.  相似文献   

16.
17.
MOTIVATION: Analysis of statistical properties of DNA sequences is important for evolutional biology as well as for DNA probe and PCR technologies. These technologies, in turn, can be used for organism identification, which implies applications in the diagnosis of infectious diseases, environmental studies, etc. RESULTS: We present results of the correlation analysis of distributions of the presence/absence of short nucleotide subsequences of different length ('n-mers', n = 5-20) in more than 1500 microbial and virus genomes, together with five genomes of multicellular organisms (including human). We calculate whether a given n-mer is present or absent (frequency of presence) in a given genome, which is not the usually calculated number of appearances of n-mers in one or more genomes (frequency of appearance). For organisms that are not close relatives of each other, the presence/absence of different 7-20mers in their genomes are not correlated. For close biological relatives, some correlation of the presence of n-mers in this range appears, but is not as strong as expected. Suppressed correlations among the n-mers present in different genomes leads to the possibility of using random sets of n-mers (with appropriately chosen n) to discriminate genomes of different organisms and possibly individual genomes of the same species including human with a low probability of error.  相似文献   

18.
DNA序列高维空间数字编码的运算法则   总被引:1,自引:0,他引:1  
DNA序列的高维空间二进制数字编码,除可以对DNA序列的碱基结构、功能基团、碱基互补、氢键强弱等性质进行编码之外,还可以方便地进行 数学运算和逻辑运算。DNA序列高维空间数字编码的运算法则是:(1)根据DNA序列数码的奇偶性质,可以推导出其与末位碱基的对应关系。当DNA序列S的数值X(S)=4n,4n 1,4n 2,4n 3时,其末位碱基依次为C,T,A,G(n=0,1,2,…)。(2)提出DNA序列高维空间的表观维数Nv,数值维数Nx及差异维数Nd的概念。当Nd=0时,首位碱基为A或G,当Nd=2n或2n 1(n=1,2,…)时,首痊碱基为(C)^n或(C)^nT。(3)推导出DNA序列点突变(单核苷酸多态性SNP)的运算法则。(4)推导出DNA重复序列(Tandem repeat)的运算法则。(5)提出DNA子序列(subsequence)的概念并定义DNA子序列的定值部Xi(digital value)和定位部Qi(location value)及其计算公式。(6)推导出DNA序列的延长运算、删除运算、缺失运算、插入运算、转位运算、换位运算和置换运算等的运算法则。(7)通过按位加运算求得DNA序列的汉明距离dh,碱基距离dh‘,基团距离dh″和共轭距离dG以及这些距离的意义与联系。(8)分析结果表明DNA序列的数字编码比常规的字符编码在数学运算上具有明显的优越性。  相似文献   

19.
Haneda T  Okada N  Miki T  Danbara H 《Plasmid》2004,52(3):218-224
The nucleotide sequence of a small plasmid, designated pRF-1, isolated from Salmonella enterica serovar Choleraesuis, was determined. We identified seven open reading frames (ORFs) encoded by 6066 nucleotides with a total G + C content of 53.6%. Analysis of the complete nucleotide sequence revealed a replicon of pRF-1 to have high similarity to the p15A origin of replication, with a possible cer-like region. ORF1, which is composed of 816 nucleotides, shows a high degree of similarity to dihydropteroate synthetase encoded by the sulII gene from plasmids in several enteropathogenic bacteria, which functions as the sulfonamide resistance determinant. In fact, Salmonella and Escherichia coli strains carrying pRF-1 were found to show strong resistance to sulfathiazole, suggesting that orf1 is a functional gene. Four of seven ORFs were found to encode putative proteins of unknown function.  相似文献   

20.
The observed frequency of folded rings has been determined as a function of fragment length and degree of resection for DNA from mouse and Necturus. The thermal stability of the ring closure and the kinetics of ring formation have been studied. As seen in the case of Drosophila DNA, mouse and Necturus DNA display a decreasing frequency of folded rings as fragment length increases. We interpret this to mean that repetitious sequences of a given type are clustered into many thousands of characteristic regions, called g-regions. The present paper focuses on the interior organization of g-regions. Variations of two competing models may be entertained: “tandem repetition” and “intermittent repetition”. If the g-regions were composed of exact, tandemly-repeating sequences, all observations can be easily explained. In order to maintain the idea that the g-regions contain repetitious blocks located at regular, or irregular intervals, one must suppose that such repetitious blocks are long (>200 nucleotide pairs), not internally repetitious, and represent perhaps 80% of the nucleotides in the g-region. Such a sequence can be thought of as a fractional-tandem repeat. For example: HIJXXXABC … HIJXXXABC … HIJXXX, where the X's stand for nucleotides composing sequences that are unrelated to each other, and the letters (ABC … HIJ) represent nucleotides in the non-internally-repetitive repeating sequence. We feel that debate cart now be profitably devoted to the question of whether approximately 80 or 100% of the tandemly-repetitious unit is in fact tandem.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号