首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Analysis of genomic sequences by Chaos Game Representation   总被引:4,自引:0,他引:4  
MOTIVATION: Chaos Game Representation (CGR) is an iterative mapping technique that processes sequences of units, such as nucleotides in a DNA sequence or amino acids in a protein, in order to find the coordinates for their position in a continuous space. This distribution of positions has two properties: it is unique, and the source sequence can be recovered from the coordinates such that distance between positions measures similarity between the corresponding sequences. The possibility of using the latter property to identify succession schemes have been entirely overlooked in previous studies which raises the possibility that CGR may be upgraded from a mere representation technique to a sequence modeling tool. RESULTS: The distribution of positions in the CGR plane were shown to be a generalization of Markov chain probability tables that accommodates non-integer orders. Therefore, Markov models are particular cases of CGR models rather than the reverse, as currently accepted. In addition, the CGR generalization has both practical (computational efficiency) and fundamental (scale independence) advantages. These results are illustrated by using Escherichia coli K-12 as a test data-set, in particular, the genes thrA, thrB and thrC of the threonine operon.  相似文献   

2.
DNA sequencing with direct blotting electrophoresis.   总被引:10,自引:0,他引:10       下载免费PDF全文
S Beck  F M Pohl 《The EMBO journal》1984,3(12):2905-2909
A method for transferring the DNA molecules of sequencing reaction mixtures onto an immobilizing matrix during electrophoresis has been developed. A blotting membrane moves with constant speed across the end of a very short, denaturing gel and collects the molecules according to size. A constant distance between bands for molecules differing in length by one nucleotide is obtained over a large range (approximately 600 nucleotides with a 5% gel), simplifying the determination of DNA sequences considerably. Reliable sequences of 500 nucleotides can be read and sequence features up to greater than 1000 nucleotides are revealed in a single experiment. The sequencing of a potential Z-DNA-forming fragment from Escherichia coli DNA is given as an example and possible further developments are discussed.  相似文献   

3.
The data deluge in post-genomic era demands development of novel data mining tools. Existing molecular phylogeny analyses (MPAs) developed for individual gene/protein sequences are alignment-based. However, the size of genomic data and uncertainties associated with alignments, necessitate development of alignment-free methods for MPA. Derivation of distances between sequences is an important step in both, alignment-dependant and alignment-free methods. Various alignment-free distance measures based on oligo-nucleotide frequencies, information content, compression techniques, etc. have been proposed. However, these distance measures do not account for relative order of components viz. nucleotides or amino acids. A new distance measure, based on the concept of 'return time distribution' (RTD) of k-mers is proposed, which accounts for the sequence composition and their relative orders. Statistical parameters of RTDs are used to derive a distance function. The resultant distance matrix is used for clustering and phylogeny using Neighbor-joining. Its performance for MPA and subtyping was evaluated using simulated data generated by block-bootstrap, receiver operating characteristics and leave-one-out cross validation methods. The proposed method was successfully applied for MPA of family Flaviviridae and subtyping of Dengue viruses. It is observed that method retains resolution for classification and subtyping of viruses at varying levels of sequence similarity and taxonomic hierarchy.  相似文献   

4.
Diversity of T cell receptor (TCR) genes is primarily generated by nucleotide insertions upon rearrangement from their germ line-encoded V, D and J segments. Nucleotide insertions at V-D and D-J junctions are random, but some small subsets of these insertions are exceptional, in that one to three base pairs inversely repeat the sequence of the germline DNA. These short complementary palindromic sequences are called P nucleotides. We apply the ImmunoSeq deep-sequencing assay to the third complementarity determining region (CDR3) of the β chain of T cell receptors, and use the resulting data to study P nucleotides in the repertoire of naïve and memory CD8+ and CD4+ T cells. We estimate P nucleotide distributions in a cross section of healthy adults and different T cell subtypes. We show that P nucleotide frequency in all T cell subtypes ranges from 1% to 2%, and that the distribution is highly biased with respect to the coding end of the gene segment. Classification of observed palindromic sequences into P nucleotides using a maximum conditional probability model shows that single base P nucleotides are very rare in VDJ recombination; P nucleotides are primarily two bases long. To explore the role of P nucleotides in thymic selection, we compare P nucleotides in productive and non-productive sequences of CD8+ naïve T cells. The naïve CD8+ T cell clones with P nucleotides are more highly expanded.  相似文献   

5.
The arrangements of inverted-repeated and repeated DNA sequences in the human genome have been investigated by an electron microscope method. The arrangement of the interspersed repeated DNA sequences is found to be similar to the corresponding arrangement found in Xenopus. This arrangement consists of 300-nucleotide-long repeated DNA sequences interspersed with roughly gene-size single-copy DNA sequences. The inverted-repeated sequences are also 300 nucleotides in length and are interspersed with the other DNA sequence classes.Most inverted-repeated sequences (64%) are spaced by another sequence which is recognized by electron microscopy as a single-stranded loop in a hairpin structure. The average length of this spacer loop is 1.6 kilobases. Although some pairs of inverted-repeated sequences are clustered, most seem to be randomly distributed throughout the genome. The average distance separating two pairs of inverted-repeated sequences is 10 to 20 kilobases. The interspersed repeated sequences and inverted-repeated sequences are arranged simultaneously in a portion of the human genome resulting in an interspersion of all three sequence classes.  相似文献   

6.
7.
We consider the problem of making allowance for superhelicity in the statistical-mechanical calculations of fluctuational violations of the DNA double helix. A simple model is discussed, making it possible in the calculations to use an approach based on the theory of helix–coil transition in DNA. The proposed algorithms allow calculating the effect of superhelicity on the base-pair fluctuational opening for any given sequence of nucleotides. An algorithm is also proposed allowing for the hairpin and cruciform structures in the palindromic regions of a sequence, as well as the open and helical states. The theory is used to calculate the melting curve for superhelical DNA at temperatures well below the melting point of the linear or nicked forms. The maps of opening probability are calculated for SV40 and ?X174 DNA using their recently published complete nucleotide sequences. The data explain well the experimental results of probing the secondary structure of these DNA by single strand-specific endonucleases.  相似文献   

8.
Graphical representation of DNA sequences is one of the most popular techniques for alignment-free sequence comparison. Here, we propose a new method for the feature extraction of DNA sequences represented by binary images, by estimating the similarity between DNA sequences using the frequency histograms of local bitmap patterns of images. Our method shows linear time complexity for the length of DNA sequences, which is practical even when long sequences, such as whole genome sequences, are compared. We tested five distance measures for the estimation of sequence similarities, and found that the histogram intersection and Manhattan distance are the most appropriate ones for phylogenetic analyses.  相似文献   

9.
Duplex adeno-associated virus (AAV) DNA, produced by annealing plus and minus virion single strands, has been digested with several bacterial restriction endonucleases. These studies reveal the existence of alternate secondary structures at the termini of duplex AAV DNA. Analysis of the sites of endo R-Hpa II cleavage, the products of complete endo R-Hpa II digestion, and the multiple terminal secondary structures leads to the conclusion that there are two possible nucleotide sequences at each end of AAV DNA. A model that attributes the terminal nucleotide sequence heterogeneity to two possible orientations of the first 120 nucleotides at each end of the DNA is proposed; in one case the sequence is 1 to 120; in the other case the sequence is inverted. An origin of the inversion is suggested based on previously described intermediates in AAV DNA replication.  相似文献   

10.
The similarity of two nucleotide sequences is often expressed in terms of evolutionary distance, a measure of the amount of change needed to transform one sequence into the other. Given two sequences with a small distance between them, can their similarity be explained by their base composition alone? The nucleotide order of these sequences contributes to their similarity if the distance is much smaller than their average permutation distance, which is obtained by calculating the distances for many random permutations of these sequences. To determine whether their similarity can be explained by their dinucleotide and codon usage, random sequences must be chosen from the set of permuted sequences that preserve dinucleotide and codon usage. The problem of choosing random dinucleotide and codon-preserving permutations can be expressed in the language of graph theory as the problem of generating random Eulerian walks on a directed multigraph. An efficient algorithm for generating such walks is described. This algorithm can be used to choose random sequence permutations that preserve (1) dinucleotide usage, (2) dinucleotide and trinucleotide usage, or (3) dinucleotide and codon usage. For example, the similarity of two 60-nucleotide DNA segments from the human beta-1 interferon gene (nucleotides 196-255 and 499-558) is not just the result of their nonrandom dinucleotide and codon usage.   相似文献   

11.
Comparison of left-end DNA sequences of bacteriophages Mu and D108   总被引:3,自引:0,他引:3  
A I Bukhari  J R Lupski  P Svec  G N Godson 《Gene》1985,33(2):235-239
The nucleotide sequences of the left ends of bacteriophage Mu DNA and that of its close relative D108 have been determined. The first 100 bp of phages Mu and D108 are substantially the same except for an octanucleotide change from bp 53 to 61 and other small interspersed base-pair changes from bp 61 to 200. The first five host nucleotides preceding the host-phage junction are generally, but not always, G + C-rich and these five nucleotides display no obvious consensus sequence. Both phages Mu and D108 share striking similarity in their end DNA sequences to the end sequences of the newly described Escherichia coli movable genetic element IS30.  相似文献   

12.
This paper proposes a graphical method for detecting interspecies recombination in multiple alignments of DNA sequences. A fixed-size window is moved along a given DNA sequence alignment. For every position, the marginal posterior probability over tree topologies is determined by means of a Markov chain Monte Carlo simulation. Two probabilistic divergence measures are plotted along the alignment, and are used to identify recombinant regions. The method is compared with established detection methods on a set of synthetic benchmark sequences and two real-world DNA sequence alignments.  相似文献   

13.
Computer programs for the assembly of DNA sequences.   总被引:26,自引:20,他引:6       下载免费PDF全文
A collection of user-interactive computer programs is described which aid in the assembly of DNA sequences. This is achieved by searching for the positions of overlapping common nucleotide sequences within the blocks of sequence obtained as primary data. Such overlapping segments are then melded into one continuous string of nucleotides. Strategies for determining the accuracy of the sequence being analyzed and reducing the error rate resulting from the manual manipulation of sequence data are discussed. Sequences mapping from 97.3 to 100% of the Ad2 virus genome were used to demonstrate the performance of these programs.  相似文献   

14.
DNA's genetic code can be represented as an alphabetic sequence composed of the four letters A, C, G, and T, which represent the four types of nucleotides--adenylic, cytidylic, guanylic, and thymidylic acid--of which DNA is composed. Now that these sequences have been identified for many genes and are available in computer-readable form, scientists can analyze these data and search for patterns in an attempt to learn more about the regulatory functions of the gene. One area of study is that of the frequency of occurrence of specific nucleotide subsequences (e.g., ACAC) within part or all of a nucleotide sequence. This paper derives the probability distribution of the frequency of occurrence of a subsequence within a nucleotide sequence, under the hypothesis that the four nucleotides occur at random and with equal probability. This distribution is nontrivial because different subsequences have different "overlap capability." For example, the subsequence AAAA can occur up to 17 times in a sequence of length 20 (which would happen if the sequence were composed solely of A's), but the subsequence ACGT cannot occur more than 5 times in a sequence of length 20. Thus, the frequency distributions are different for each type of overlap capability. It is of interest to assess and compare the degree of nonrandomness for different subsequences or among different portions of a sequence; the existence and degree of nonrandomness may be related to the type and degree of functionality of a nucleotide (sub)sequence. The frequency distributions provided here can be used to perform exact significance tests of the hypothesis of randomness. An approximate test is also described for use with long sequences; this can be used to test a more general null hypothesis of nucleotides occurring with unequal probabilities.  相似文献   

15.
Structure of the rat prolactin gene   总被引:17,自引:0,他引:17  
The organization and sequence of the rat preprolactin gene has been investigated. Analysis of two different plasmids containing pituitary cDNA inserts has provided the complete 681-nucleotide coding sequence of preprolactin as well as 17 nucleotides preceding the initiation codon and 90 nucleotides following the termination codon. Digestion of rat chromosomal DNA with the restriction endonuclease Eco RI followed by size fractionation and hybridization to a labeled prolactin cDNA probe has demonstrated that prolactin genomic sequences are located on 6.0-, 3.9-, and 2.9-kilobase fragments. The 6.0- and 3.9-kilobase fragments were isolated from a library of cloned rat DNA fragments. The sequence of more than 1800 nucleotides of the cloned DNA has been determined. The sequenced region contains coding regions of 180 and 189 nucleotides which specify the COOH-terminal 123 amino acids of the 227-amino-acid sequence of rat preprolactin. These coding regions are separated by an intervening sequence of 597 nucleotides. At least one other large intervening sequence separates this region from the region coding for the NH2-terminal portion of preprolactin. Hybridization experiments suggested that the intervening sequences of the rat prolactin gene contain DNA sequences which are repeated elsewhere in the rat genome.  相似文献   

16.
A sequence of 1019 nucleotides encompassing one of the 600 base inverted repeats and non-repeated flanking regions has been determined in the type A yeast 2 micrometers plasmid cloned in pMB9. Methods are described for applying the Maxam-Gilbert sequencing procedure to DNA fragments labelled at the 3'-end using a T4-polymerase exchange/repair reaction and for sequencing 5'-end labelled fragments using dideoxy-nucleotides as chain terminators in the presence of E. coli DNA polymerase (nach Klenow). A notable feature of the sequence is its unusual content of symmetry elements. In one region of 140 nucleotides, 137 are involved in a complex arrangement of direct and inverted repeats linked by palindromic sequences.  相似文献   

17.
DNA sequence predicted from polyacrylamide gel-based technologies is inaccurate because of variations in the quality of the primary data due to limitations of the technology, and to sequence-specific variations due to nucleotide interactions within the DNA molecule and with the gel. The ability to recognize the probability of error in the primary data will be useful in reconstructing the target sequence of a DNA sequencing project, and in estimating the accuracy of the final sequence. This paper describes the use of linear discriminant analysis to assign position-specific probabilities of incorrect, over- and under-prediction of nucleotides for each predicted nucleotide position in primary sequence data generated by a gel-based DNA sequencing technology. Using this method, most of the error potential in primary sequence data can be assigned to a limited number of discrete positions. The use of probability values in the sequence reconstruction process, and in estimating the accuracy of consensus sequence determination is described.  相似文献   

18.
19.
The motA and motB gene products of Escherichia coli are integral membrane proteins necessary for flagellar rotation. We determined the DNA sequence of the region containing the motA gene and its promoter. Within this sequence, there is an open reading frame of 885 nucleotides, which with high probability (98% confidence level) meets criteria for a coding sequence. The 295-residue amino acid translation product had a molecular weight of 31,974, in good agreement with the value determined experimentally by gel electrophoresis. The amino acid sequence, which was quite hydrophobic, was subjected to a theoretical analysis designed to predict membrane-spanning alpha-helical segments of integral membrane proteins; four such hydrophobic helices were predicted by this treatment. Additional amphipathic helices may also be present. A remarkable feature of the sequence is the existence of two segments of high uncompensated charge density, one positive and the other negative. Possible organization of the protein in the membrane is discussed. Asymmetry in the amino acid composition of translated DNA sequences was used to distinguish between two possible initiation codons. The use of this method as a criterion for authentication of coding regions is described briefly in an Appendix.  相似文献   

20.
Five independent clones containing the natural chicken ovomucoid gene have been isolated from a chicken gene library. One of these clones, CL21, contains the complete ovomucoid gene and includes more than 3 kb of DNA sequences flanking both termini of the gene. Restriction endonuclease mapping, electron microscopy and direct DNA sequencing analyses of this clone have revealed that the ovomucoid gene is 5.6 kb long and codes for a messenger RNA of 821 nucleotides. The structural gene sequence coding Ifor the mature messenger RNA is split into at least eight segments by a minimum of seven intervening sequences of various sizes. The shortest structural gene segment is only 20 nucleotides long. All seven intervening sequences are located within the peptide coding region of the gene, and the sequences at the 5' and 3' untranslated regions of the mRNA are not interrupted by intervening sequences. The DNA sequences of the regions flanking the 5' and 3' termini of the gene have been determined. Thirty nucleotides before the start of the messenger RNA coding sequence is the heptanucleotide TATATAT, which is also present in a similar location relative to the chicken ovalbumin gene and other unique sequence eucaryotic genes. This sequence resembles that of the Pribnow box in procaryotic genes where a promoter function has been implicated. Seven nucleotides past the 3' end of the gene is the tetranucleotide TTGT, a sequence found to be present at identical locations as either TTTT or TTGT in other eucaryotic genes that have been sequenced. These conserved DNA sequences flanking eucaryotic genes may serve some regulator function in the expression of these genes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号