首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
A collection of random Drosophila melanogaster DNA fragments cloned individually in Escherichia coli was screened for the presence of sequences complementary to the 4 S, 5 S and 5.8 S RNA species produced in the D. melanogaster Kc tissue culture line. Four D. melanogaster DNA fragments were found which possessed sequences complementary to the 4 S RNA species but not complementary to the 5 S or 5.8 S RNA. One such cloned fragment (6.81 kilobase in length) was characterized further. It hybridizes in situ to region 22A-C of the left arm of chromosome 2 and does not contain repetitive sequences detectable by renaturation (cot) analysis. This same region was reported earlier by Steffensen and Wimber (Genetics (1971) 69, 163--178) to hybridize in situ to bulk tRNA extracted from D. melanogaster.  相似文献   

3.
The problem tackled here concerns the feasibility of DNA sequencingusing hybridization methods. We establish algorithms for andcomputational limitations to the reconstruction of a sequencefrom all its subsequences having the same length: in other words,the building of a string that contains all the words of a givenset, and only these ones. Generally there are several possiblestrings. We refer to graph theory and propose an algorithm toenumerate all the strings that are solutions. We then carriedout stimulations using real DNA sequences. They provided somenecessary conditions and give some upper bounds to the lengthof the sequence to recover in relation with the length of oligonucleotides.To avoid limiting ourselves to problems that admit a uniquesolution, we introduce another algorithm that produces a signaturefor each solution string. Each signature can be tested to determinewhich one belongs to the correct sequence.  相似文献   

4.
Myers' elegant and powerful bit-parallel dynamic programming algorithm for approximate string matching has a restriction that the query length should be within the word size of the computer, typically 64. We propose a modification of Myers' algorithm, in which the modification has a restriction not on the query length but on the maximum number of mismatches (substitutions, insertions, or deletions), which should be less than half of the word size. The time complexity is O(m log |Σ|), where m is the query length and |Σ| is the size of the alphabet Σ. Thus, it is particularly suited for sequences on a small alphabet such as DNA sequences. In particular, it is useful in quickly extending a large number of seed alignments against a reference genome for high-throughput short-read data produced by next-generation DNA sequencers.  相似文献   

5.
Monte Carlo simulations are useful to verify the significance of data. Genomic regularities, such as the nucleotide correlations or the not uniform distribution of the motifs throughout genomic or mature mRNA sequences, exist and their significance can be checked by means of the Monte Carlo test. The test needs good quality random sequences in order to work, moreover they should have the same nucleotide distribution as the sequences in which the regularities have been found. Random DNA sequences are also useful to estimate the background score of an alignment, that is a threshold below which the resulting score is merely due to chance. We have developed RANDNA, a free software which allows to produce random DNA or RNA sequences setting both their length and the percentage of nucleotide composition. Sequences having the same nucleotide distribution of exonic, intronic or intergenic sequences can be generated. Its graphic interface makes it possible to easily set the parameters that characterize the sequences being produced and saved in a text format file. The pseudo-random number generator function of Borland Delphi 6 is used, since it guarantees a good randomness, a long cycle length and a high speed. We have checked the quality of sequences generated by the software, by means of well-known tests, both by themselves and versus genuine random sequences. We show the good quality of the generated sequences. The software, complete with examples and documentation, is freely available to users from: http://www.introni.it/en/software.  相似文献   

6.
An accurate approximation is derived to the distribution of the length of the longest matching word present between two random DNA sequences of finite length, using only elementary probability arguments. The distribution is shown to be consistent with previous asymptotic results for the mean and variance of longest common words. The application of the distribution to assessing the statistical significance of sequence similarities is considered. It is shown how the distribution can be modified to take account of non-independence of neighbouring bases in real sequences.  相似文献   

7.
We have isolated four repetitive DNA fragments from maize DNA. Only one of these sequences showed homology to sequences within the EMBL database, despite each having an estimated copy number of between 3 x 104 and 5 x 104 per haploid genome. Hybridization of the four repeats to maize mitotic chromosomes showed that the sequences are evenly dispersed throughout most, but not all, of the maize genome, whereas hybridization to yeast colonies containing random maize DNA fragments inserted into yeast artificial chromosomes (YACs) indicated that there was considerable clustering of the repeats at a local level. We have exploited the distribution of the repeats to produce repetitive sequence fingerprints of individual YAC clones. These fingerprints not only provide information about the occurrence and organization of the repetitive sequences within the maize genome, but they can also be used to determine the organization of overlapping maize YAC clones within a contiguous fragment (contigs). Key words : maize, repetitive DNA, YACs.  相似文献   

8.
This report deals with the study of compositional properties of human gene sequences evaluating similarities and differences among functionally distinct sectors of the gene independently of the reading frame. To retrieve the compositional information of DNA, we present a neighbor base dependent coding system in which the alphabet of 64 letters (DNA triplets) is compressed to an alphabet of 14 letters here termed triplet composons. The triplets containing the same set of distinct bases in whatever order and number form a triplet composon. The reading of the DNA sequence is performed starting at any letter of the initial triplet and then moving, triplet-to-triplet, until the end of the sequence. The readings were made in an overlapping way along the length of the sequences. The analysis of the compositional content in terms of the composon usage frequencies of the gene sequences shows that: (i) the compositional content of the sequences is far from that of random sequences, even in the case of non-protein coding sequences; (ii) coding sequences can be classified as components of compositional clusters; and (iii) intron sequences in a cluster have the same composon usage frequencies, even as their base composition differs notably from that of their home coding sequences. A comparison of the composon usage frequencies between human and mouse homologous genes indicated that two clusters found in humans do not have their counterpart in mouse whereas the others clusters are stable in both species with respect to their composon usage frequencies in both coding and noncoding sequences.  相似文献   

9.
A method is developed to study the periodic properties of nucleotide sequences allowing the favoured pattern of the repeating unit, as well as the length and localization of this periodic segment to be determined simultaneously. The degree of periodicity is evaluated calculating the probabilities for random occurrence of the maximal deviations of the nucleotide composition in each phase, making use of the binomial formula.The nucleotide sequence of the tobacco mosaic virus (TMV) RNA responsible for recognition of the homologous protein (“assembly origin”, AO) (Zimmern & Butler, 1977) was investigated in order to find periodic regions of primary structure which might be essential in the recognition process. As a result the most periodic segments of the AO consisting of 31 and 17 nucleotides corresponding to the schemes GAU or GA1 have been found. However, the periodicities in these regions do not exceed that expected for random sequences. It can be considered as an evidence that in addition to peculiarities of primary structure, some other features such as RNA secondary or tertiary structure are essential in this interaction.For comparison the nucleotide sequences of the other fragments of TMV RNA as well as MS2 RNA, TYMV RNA, 16S rRNA and phage fd DNA were investigated by the same method.  相似文献   

10.
Efficient detection of unusual words.   总被引:3,自引:0,他引:3  
Words that are, by some measure, over- or underrepresented in the context of larger sequences have been variously implicated in biological functions and mechanisms. In most approaches to such anomaly detections, the words (up to a certain length) are enumerated more or less exhaustively and are individually checked in terms of observed and expected frequencies, variances, and scores of discrepancy and significance thereof. Here we take the global approach of annotating the suffix tree of a sequence with some such values and scores, having in mind to use it as a collective detector of all unexpected behaviors, or perhaps just as a preliminary filter for words suspicious enough to undergo a more accurate scrutiny. We consider in depth the simple probabilistic model in which sequences are produced by a random source emitting symbols from a known alphabet independently and according to a given distribution. Our main result consists of showing that, within this model, full tree annotations can be carried out in a time-and-space optimal fashion for the mean, variance and some of the adopted measures of significance. This result is achieved by an ad hoc embedding in statistical expressions of the combinatorial structure of the periods of a string. Specifically, we show that the expected value and variance of all substrings in a given sequence of n symbols can be computed and stored in (optimal) O(n2) overall worst-case, O (n log n) expected time and space. The O (n2) time bound constitutes an improvement by a linear factor over direct methods. Moreover, we show that under several accepted measures of deviation from expected frequency, the candidates over- or underrepresented words are restricted to the O(n) words that end at internal nodes of a compact suffix tree, as opposed to the theta(n2) possible substrings. This surprising fact is a consequence of properties in the form that if a word that ends in the middle of an arc is, say, overrepresented, then its extension to the nearest node of the tree is even more so. Based on this, we design global detectors of favored and unfavored words for our probabilistic framework in overall linear time and space, discuss related software implementations and display the results of preliminary experiments.  相似文献   

11.
We study the problem of approximate non-tandem repeat extraction. Given a long subject string S of length N over a finite alphabet Sigma and a threshold D, we would like to find all short substrings of S of length P that repeat with at most D differences, i.e., insertions, deletions, and mismatches. We give a careful theoretical characterization of the set of seeds (i.e., some maximal exact repeats) required by the algorithm, and prove a sublinear bound on their expected numbers. Using this result, we present a sub-quadratic algorithm for finding all short (i.e., of length O(log N)) approximate repeats. The running time of our algorithm is O(DN(3pow(epsilon)-1)log N), where epsilon = D/P and pow(epsilon) is an increasing, concave function that is 0 when epsilon = 0 and about 0.9 for DNA and protein sequences.  相似文献   

12.
A new method to compare two (or several) symbol sequences is developed. The method is based on the comparison of the frequencies of the small fragments of the compared sequences; it requires neither string editing, nor other transformations of the compared objects. The comparison is executed through a calculation of the specific entropy of a frequency dictionary against the special dictionary called the hybrid one; this latter is the statistical ancestor of the group of sequences under comparison. Some applications of the developed method in the fields of genetics and bioinformatics are discussed.  相似文献   

13.
A statistical reference for RNA secondary structures with minimum free energies is computed by folding large ensembles of random RNA sequences. Four nucleotide alphabets are used: two binary alphabets, AU and GC, the biophysical AUGC and the synthetic GCXK alphabet. RNA secondary structures are made of structural elements, such as stacks, loops, joints, and free ends. Statistical properties of these elements are computed for small RNA molecules of chain lengths up to 100. The results of RNA structure statistics depend strongly on the particular alphabet chosen. The statistical reference is compared with the data derived from natural RNA molecules with similar base frequencies. Secondary structures are represented as trees. Tree editing provides a quantitative measure for the distance dt, between two structures. We compute a structure density surface as the conditional probability of two structures having distance t given that their sequences have distance h. This surface indicates that the vast majority of possible minimum free energy secondary structures occur within a fairly small neighborhood of any typical (random) sequence. Correlation lengths for secondary structures in their tree representations are computed from probability densities. They are appropriate measures for the complexity of the sequence-structure relation. The correlation length also provides a quantitative estimate for the mean sensitivity of structures to point mutations. © 1993 John Wiley & Sons, Inc.  相似文献   

14.
This article deals with the relationship between vocabulary (total number of distinct oligomers or “words”) and text-length (total number of oligomers or “words”) for a coding DNA sequence (CDS). For natural human languages, Heaps established a mathematical formula known as Heaps’ law, which relates vocabulary to text-length. Our analysis shows that Heaps’ law fails to model this relationship for CDSs. Here we develop a mathematical model to establish the relationship between the number of type of words (vocabulary) and the number of words sampled (text-length) for CDSs, when non-overlapping nucleotide strings with the same length are treated as words. We use tangent-hyperbolic function, which captures the saturation property of vocabulary. Based on the parameters of the model, we formulate a mathematical equation, known as “equation of word organization”, whose parameters essentially indicate that nucleotide organization of coding sequences are different from one another. We also compare the word organization of CDSs with the random word distribution and conclude that a CDS is neither similar to a natural human language nor to a random one. Moreover, these sequences have their unique nucleotide organization and it is completely structured for specific biological functioning.  相似文献   

15.
Many noncoding RNAs (ncRNAs) function through both their sequences and secondary structures. Thus, secondary structure derivation is an important issue in today's RNA research. The state-of-the-art structure annotation tools are based on comparative analysis, which derives consensus structure of homologous ncRNAs. Despite promising results from existing ncRNA aligning and consensus structure derivation tools, there is a need for more efficient and accurate ncRNA secondary structure modeling and alignment methods. In this work, we introduce a consensus structure derivation approach based on grammar string, a novel ncRNA secondary structure representation that encodes an ncRNA's sequence and secondary structure in the parameter space of a context-free grammar (CFG) and a full RNA grammar including pseudoknots. Being a string defined on a special alphabet constructed from a grammar, grammar string converts ncRNA alignment into sequence alignment. We derive consensus secondary structures from hundreds of ncRNA families from BraliBase 2.1 and 25 families containing pseudoknots using grammar string alignment. Our experiments have shown that grammar string-based structure derivation competes favorably in consensus structure quality with Murlet and RNASampler. Source code and experimental data are available at http://www.cse.msu.edu/~yannisun/grammar-string.  相似文献   

16.
Optimal reconstruction of a sequence from its probes.   总被引:4,自引:0,他引:4  
An important combinatorial problem, motivated by DNA sequencing in molecular biology, is the reconstruction of a sequence over a small finite alphabet from the collection of its probes (the sequence spectrum), obtained by sliding a fixed sampling pattern over the sequence. Such construction is required for Sequencing-by-Hybridization (SBH), a novel DNA sequencing technique based on an array (SBH chip) of short nucleotide sequences (probes). Once the sequence spectrum is biochemically obtained, a combinatorial method is used to reconstruct the DNA sequence from its spectrum. Since technology limits the number of probes on the SBH chip, a challenging combinatorial question is the design of a smallest set of probes that can sequence an arbitrary DNA string of a given length. We present in this work a novel probe design, crucially based on the use of universal bases [bases that bind to any nucleotide (Loakes and Brown, 1994)] that drastically improves the performance of the SBH process and asymptotically approaches the information-theoretic bound up to a constant factor. Furthermore, the sequencing algorithm we propose is substantially simpler than the Eulerian path method used in previous solutions of this problem.  相似文献   

17.
An oligopurine sequence bias occurs in eukaryotic viruses.   总被引:10,自引:6,他引:4  
Twenty four DNA and RNA viral nucleotide sequences, comprising over 346 kilobases, have been analyzed for the occurrence of strings of contiguous purine or pyrimidine residues. On average strings greater than or equal to 10 contiguous purines or pyrimidines are found three and a half times more frequently than would be expected for a random distribution of bases. Detailed analysis of the 172 kilobase Epstein-Barr viral sequence shows that the bias in favor of contiguous purine residues increases with the length of the purine string. These findings are similar to those seen for genomic DNA from higher eukaryotes. In contrast no overrepresentation of oligopurine or oligopyrimidine strings is observed in 52 kilobases from eight bacteriophage and E. coli DNA sequences.  相似文献   

18.
Chicken DNA has been digested with restriction enzymes and the size distribution of the DNA fragments containing ovalbumin specific sequences has been examined after separation of the fragments on agarose gels and transfer to nitrocellulose sheets. Hybridisation with terminally 32P-labelled ovalbumin mRNA fragments or with RNA populations transcribed from the DNA of a hybrid plasmid containing ovalbumin sequences was used to locate the DNA fragments coding for ovalbumin. Digestion with enzymes which do not cut within the portion of the ovalbumin gene synthesised from ovalbumin messenger RNA in vitro has shown the presence of several defined fragments carrying ovalbumin specific sequences. Possible explanations of these observations are discussed.  相似文献   

19.
Analysis of the nucleotide sequences at the 5' ends of RNA-primed nascent DNA chains (Okazaki fragments) and of their locations in replicating simian virus 40 (SV40) DNA revealed the precise nature of Okazaki fragment initiation sites in vivo. The primary initiation site for mammalian DNA primase was 3'-purine-dT-5' in the DNA template and the secondary site was 3'-purine-dC-5', with the 5' end of the RNA primer complementary to either the dT or dC. The third position of the initiation site was variable with a preference for dT or dA. About 81% of the available 3'-purine-dT-5' sites and 20% of the 3'-purine-dC-5' sites were used. Purine-rich sites, such as PuPuPu and PyPuPu , were excluded. The 5'-terminal ribonucleotide composition of Okazaki fragments corroborated these conclusions. Furthermore, the length of individual RNA primers was not unique, but varied in size from six to ten bases with some appearing as short as three bases and some as long as 12 bases, depending on the initiation site used. This result was consistent with the average size (9 to 11 bases) of RNA primers isolated from specific regions of the genome. Excision of RNA primers did not appear to stop at the RNA-DNA junction, but removed a variable number of deoxyribonucleotides from the 5' end of the nascent DNA chain. Finally, only one-fourth of the replication forks contained an Okazaki fragment, and the distribution of their initiation sites between the two arms revealed that Okazaki fragments were initiated exclusively (99%) on retrograde DNA templates. The data obtained at two genomic sites about 350 and 1780 bases from ori were essentially the same as that reported for the ori region (Hay & DePamphilis , 1982), suggesting that the mechanism used to synthesize the first DNA chain at ori is the same as that used to synthesize Okazaki fragments throughout the genome.  相似文献   

20.
The location of nucleosomes on genes for 5 S rRNA in rat liver was determined by the preparation of nucleosome core DNA fragments, hybridization with 5 S rRNA, RNase digestion, and gel electrophoresis. The resulting size distribution of RNA fragments was essentially the same as that found when the experiment was carried out with random DNA fragments.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号