首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
W Saurin  P Marlière 《Biochimie》1985,67(5):517-521
A set of sequences can be defined by their common subsequences, and the length of these is a measure of the overall resemblance of the set. Each subsequence corresponds to a succession of symbols embedded in every sequence, following the same order but not necessarily contiguous. Determining the longest common subsequence (LCS) requires the exhaustive testing of all possible common subsequences, which sum up to about 2L, if L is the length of the shortest sequence. We present a polynomial algorithm (O(n X L4), where n is the number of sequences) for generating strings related to the LCS and constructed with the sequence alphabet and an indetermination symbol. Such strings are iteratively improved by deleting indetermination symbols and concomitantly introducing the greatest number of alphabet symbols. Processed accordingly, nucleic acid and protein sequences lead to key-words encompassing the salient positions of homologous chains, which can be used for aligning or classifying them, as well as for finding related sequences in data banks.  相似文献   

2.
Comparative analysis of related DNA sequences has been simplified by the transformation of data in the standard A, G, C, T format into a set of geometric symbols that promote pattern recognition. Previously, comparing more than 2 or 3 sequences simultaneously has been difficult because of the monotonous patterns established by letters. Here 33 sequences are simultaneously compared to demonstrate the ease with which nucleotide substitutions are accurately identified. This has been accomplished by writing a Word-Perfect macro program to facilitate this transformation. Since this word processing program is widely used, performing this kind of analysis is readily achievable in most laboratories involved in DNA sequence analysis.  相似文献   

3.
Pattee HH 《Bio Systems》2001,60(1-3):5-21
Evolution requires the genotype–phenotype distinction, a primeval epistemic cut that separates energy-degenerate, rate-independent genetic symbols from the rate-dependent dynamics of construction that they control. This symbol–matter or subject–object distinction occurs at all higher levels where symbols are related to a referent by an arbitrary code. The converse of control is measurement in which a rate-dependent dynamical state is coded into quiescent symbols. Non-integrable constraints are one necessary condition for bridging the epistemic cut by measurement, control, and coding. Additional properties of heteropolymer constraints are necessary for biological evolution.  相似文献   

4.
This paper concerns sequences of letters in which certain “distinguished” words are of interest. Such sequences arise as data in numerous fields including genetics and neuroscience. A probability distribution is given for the number of occurrences of a chosen word in a randomized sequence of letters. Such words are considered “favored” if they occur more than expected at random. Favored words have been discovered in nerve impulse trains and may reflect a neural coding scheme. This article is dedicated to my mother, Margaret Oakley Dayhoff, whose enthusiasm encouraged me to pursue research in mathematical biology.  相似文献   

5.
A Tremolieres 《Biochimie》1980,62(7):493-496
In this article it is suggested that the first coding system started from specific interactions between the nucleotide part of nucleotidic cofactors and enzymes. These interactions generated a first primitive code of four words of one letter each for the four primittive amino acids (phenylalanine, lysine, glycine and proline); when the triplet code (which allowed the integration of 20 amino acids into proteins) progressively appeared, it must have been modulated by the existence of this first coding system.  相似文献   

6.
7.
S. OHNO 《Animal genetics》1988,19(4):305-316
Inasmuch as all events in this universe are governed by multitudes of periodicities, it is a mistake to regard any coding sequence as unique implying the descent from random assemblages of four bases. Instead, each coding sequence is comprised of primordial and derived repeating units. In the case of families of proteins with transmembrane alpha-helices, the primordial repeating units of their coding sequences were base heptamers, thus, giving the heptapeptidic periodicity very conductive to alpha-helix formation to the original polypeptide chains. Even in modern coding sequences for these families of proteins, intact and base-substituted copies of these primordial heptamers are found in more or less even distribution along the entire coding sequence. In addition, there are now locally prominent tandemly recurring units that are only remotely related to primordial heptamers. In the case of Ca++ channel, local prominence of one such nonameric unit gave a unique tripeptidic periodicity to the fourth helix of each unit giving to it a girdle of positively charged residues. All these complex interplays between primordial and derived recurring units that characterize each coding sequence can best be appreciated by their musical transformation. The transformed musical score of a pertinent part of rabbit skeletal muscle Ca++ channel coding sequence is given.  相似文献   

8.
This report deals with the study of compositional properties of human gene sequences evaluating similarities and differences among functionally distinct sectors of the gene independently of the reading frame. To retrieve the compositional information of DNA, we present a neighbor base dependent coding system in which the alphabet of 64 letters (DNA triplets) is compressed to an alphabet of 14 letters here termed triplet composons. The triplets containing the same set of distinct bases in whatever order and number form a triplet composon. The reading of the DNA sequence is performed starting at any letter of the initial triplet and then moving, triplet-to-triplet, until the end of the sequence. The readings were made in an overlapping way along the length of the sequences. The analysis of the compositional content in terms of the composon usage frequencies of the gene sequences shows that: (i) the compositional content of the sequences is far from that of random sequences, even in the case of non-protein coding sequences; (ii) coding sequences can be classified as components of compositional clusters; and (iii) intron sequences in a cluster have the same composon usage frequencies, even as their base composition differs notably from that of their home coding sequences. A comparison of the composon usage frequencies between human and mouse homologous genes indicated that two clusters found in humans do not have their counterpart in mouse whereas the others clusters are stable in both species with respect to their composon usage frequencies in both coding and noncoding sequences.  相似文献   

9.
Ubiquitin coding sequences were isolated from a human genomic library and two cDNA libraries. One human ubiquitin gene consists of 2055 nucleotides and codes for a polyprotein consisting of 685 amino acid residues. The polyprotein contains nine direct repeats of the ubiquitin amino acid sequence and the last ubiquitin sequence is extended with an additional valyl residue at the C-terminal end. No spacer sequences separate the ubiquitin repeats and the coding regions are not interrupted by intervening sequences. This particular gene is transcribed since cDNAs corresponding to the genomic sequence have been isolated. At least two more types of ubiquitin genes are encoded in the human genome, one coding for an ubiquitin monomer while another presumably codes for three or four direct repeats of the ubiquitin sequence. Human DNA contains many copies of the ubiquitin sequence. Ubiquitin is therefore encoded in the human genome as a multigene family.  相似文献   

10.
Plants of Nicotiana benthamiana were transformed with four constructs based on the coat protein gene of a poplar mosaic carlavirus (PMV) isolate from the UK. The four constructs were: the capsid protein coding sequence plus a portion of the adjacent sequence encoding a protein with a molecular mass of 14 kDa (CP14k); the capsid protein coding sequence in the positive sense (CPP); a mutated capsid protein coding sequence (CPM) and the capsid protein coding sequence in the negative sense (CPN). Forty-one regenerated plants, after selection for their kanamycin resistance, were confirmed by PCR to contain the appropriate sequences. Virus coat protein was detected in small amounts in 50% of the plants transformed with the CP14k or CPP constructs. Primary transformants showed a range of reactions to challenge with two isolates of PMV. These varied from apparently no infection in inoculated or in later-formed young leaves, as assessed by ELISA, to typical systemic symptoms associated with large amounts of serologically detected virus. There was no correlation between the level of protection against virus infection and the observed accumulation of transgene protein product. Plants were protected whether transformed with the coat protein coding sequence in the positive or negative sense.  相似文献   

11.
By sequence alignment of the extracellular Serratia marcescens nuclease with three related nucleases we have identified seven charged amino acid residues which are conserved in all four sequences. Six of these residues together with four other partially conserved His or Asp residues were changed to alanine by site-directed PCR-mediated mutagenesis using a variant of the nuclease gene in which the coding sequence of the signal peptide was replaced by the coding sequence for an N-terminal affinity tag [Met(His)6GlySer]. Four of the mutant proteins showed almost no reduction in nuclease activity but five displayed a 10- to 1000-fold reduction in activity and one (His110Ala) was inactive. Based upon these results it is suggested that the S.marcescens nuclease employs a mechanism in which His110 acts in concert with a Mg2+ ion and three carboxylates (Asp107, Glu148 and Glu232) as well as one or two basic amino acid residues (Arg108, Arg152).  相似文献   

12.
Traditional strategies for establishing shRNA expression constructs are inefficient, error-prone, or costly. We describe a simple approach that overcomes these drawbacks. Briefly, the sense and antisense strands of the short hairpin RNA coding sequence are segmented into two parts, respectively, at asymmetric sites. The four resulting short oligonucleotides are synthesized. Each oligonucleotide is annealed with its opposite, resulting in a double-stranded fragment with sticky termini at both ends. The two fragments so generated can be easily spliced by simple ligation to reconstitute the full-length short hairpin RNA coding sequence which can then be cloned into an appropriately restricted vector.  相似文献   

13.
D W Chung  E W Davie 《Biochemistry》1984,23(18):4232-4236
cDNAs and the genomic DNA coding for the gamma and gamma' chains of human fibrinogen have been isolated and characterized by sequence analysis. The cDNAs coding for the gamma and gamma' chains share a common nucleotide sequence coding for the first 407 amino acid residues in each polypeptide chain. The predominant gamma chain contains an additional four amino acids on its carboxyl-terminal end (residues 408-411). These four amino acids, together with the 3' noncoding sequences, are encoded by the tenth exon. Removal of the ninth intervening sequence following the processing and polyadenylation reactions yields a mature mRNA coding for the predominant gamma chain. The less prevalent gamma' chain contains 20 amino acids at its carboxyl-terminal end (residues 408-417). These 20 amino acids are encoded by the immediate 5' end of the ninth intervening sequence. This results from an occasional processing and polyadenylation reaction that occurs within the region normally constituting the ninth intervening sequence. Accordingly, the gene for the gamma chain of human fibrinogen gives rise to two mRNAs that differ in sequence on their 3' ends. These mRNAs code for polypeptide chains with different carboxyl-terminal sequences. Both of these polypeptides are incorporated into the fibrinogen molecule present in plasma.  相似文献   

14.
Children who are retarded readers may present a complex problem involving physical impediments, emotional distress, or teaching methods. A child with specific reading disability has spatial confusion, an exaggeration or persistence of a normal childhood tendency to reversal of letters and symbols, ambidexterity, normal intelligence, and poor visual recall of words. Children with these characteristics fail to learn to read in a teaching system in which the main emphasis is on visual associations. Treatment of such reading difficulties, as well as prophylactic measures, is outlined.  相似文献   

15.
It is known that different codons may be unified into larger groups related to the hierarchical structure, approximate hidden symmetries, and evolutionary origin of the universal genetic code. Using a simplified evolutionary motivated two-letter version of genetic code, the general principles of the most stable coding are discussed. By the complete enumeration in such a reduced code it is strictly proved that the maximum stability with respect to point mutations and shifts in the reading frame needs the fixation of the middle letters within codons in groups with different physico-chemical properties, thus, explaining a key feature of the universal genetic code. The translational stability of the genetic code is studied by the mapping of code onto de Bruijn graph providing both the compact visual representation of mutual relationships between different codons as well as between codons and protein coding DNA sequence and a powerful tool for the investigation of stability of protein coding. Then, the results are extended to four-letter codes. As is shown, the universal genetic code obeys mainly the principles of optimal coding. These results demonstrate the hierarchical character of optimization of universal genetic code with strictly optimal coding being evolved at the earliest stages of molecular evolution. Finally, the universal genetic code is compared with the other natural variants of genetic codes.  相似文献   

16.
A model for topological coding of proteins is proposed. The model is based on the capacity of hydrogen bonds (property of connectivity) to fix conformations of protein molecules. The protein chain is modeled by an n -arc graph with the following elements: vertices (alpha -carbon atoms), structural edges (peptide bonds) and connectivity edges (virtual edges connecting non-adjacent atoms). It was shown that 64 conformations of the 4-arc graph can be described in the binary system by matrices of six variables which form a supermatrix containing four blocks. On the basis of correspondences between the pairs of variables in matrices and four letters of the genetic code matrices and supermatrix are converted, respectively, into the triplets and the table of the genetic code. An algorithm admitting computer programming is proposed for coding the n -arc graph and protein chain. Connectivity operators (polar amino acids) are assigned to blocks of triplets coding for cyclic conformations (G, A-in the second position), while anti-connectivity operators (non-polar amino acids) correspond to blocks of triplets coding for open conformations (C, U-in the second position). Amino acids coded by triplets differing by the first base have different structures. The third base for C, U and G, A is degenerated. Properties of the real genetic code are in full agreement with the model. The model provides an insight into the topological nature of the genetic code and can be used for development of algorithms for the prediction of the protein structure.  相似文献   

17.
18.
In the process of analysing the four available complete archaeal genomes, we have noted that certain regions characterised as 'non-coding' exhibit significant sequence similarity to other protein sequences from Archaea and other species. Using established technology, we have identified a number of potential protein coding regions in these putative 'non-coding' regions. We have detected 524 such cases, of which 113 regions appear to code for proteins present in archaeal or other species, while the remaining 411 regions are mostly start/stop definition conflicts. Of the 113 protein coding regions, only 21 code for proteins with homologues of known function. The number of novel coding sequences identified herein amounts to 1. 5% of the total genome entries, while the conflicting cases represent an additional 5%. The observed differences between the four complete archaeal genomes seem to reflect disparate approaches to genome annotation. Genome sequence collections should be regularly checked to improve gene prediction by sequence similarity and greater effort is required to make gene definitions consistent across related species.  相似文献   

19.
In the process of making full-length cDNA, predicting protein coding regions helps both in the preliminary analysis of genes and in any succeeding process. However, unfinished cDNA contains artifacts including many sequencing errors, which hinder the correct evaluation of coding sequences. Especially, predictions of short sequences are difficult because they provide little information for evaluating coding potential. In this paper, we describe ANGLE, a new program for predicting coding sequences in low quality cDNA. To achieve error-tolerant prediction, ANGLE uses a machine-learning approach, which makes better expression of coding sequence maximizing the use of limited information from input sequences. Our method utilizes not only codon usage, but also protein structure information which is difficult to be used for stochastic model-based algorithms, and optimizes limited information from a short segment when deciding coding potential, with the result that predictive accuracy does not depend on the length of an input sequence. The performance of ANGLE is compared with ESTSCAN on four dataset each of them having a different error rate (one frame-shift error or one substitution error per 200-500 nucleotides) and on one dataset which has no error. ANGLE outperforms ESTSCAN by 9.26% in average Matthews's correlation coefficient on short sequence dataset (< 1000 bases). On long sequence dataset, ANGLE achieves comparable performance.  相似文献   

20.
Cloned DNAs encoding four different proteins have been isolated from recombinant cDNA libraries constructed with Glycine max seed mRNAs. Two cloned DNAs code for the alpha and alpha'-subunits of the 7S seed storage protein (conglycinin). The other cloned cDNAs code for proteins which are synthesized in vitro as 68,000 d., 60,000 d. or 53,000 d. polypeptides. Hybrid selection experiments indicate that, under low stringency hybridization conditions, all four cDNAs hybridize with mRNAs for the alpha and alpha'-subunits and the 68,000 d., 60,000 d. and 53,000 d. in vitro translation products. Within three of the mRNA, there is a conserved sequence of 155 nucleotides which is responsible for this hybridization. The conserved nucleotides in the alpha and alpha'-subunit cDNAs and the 68,000 d. polypeptide cDNAs span both coding and noncoding sequences. The differences in the coding nucleotides outside the conserved region are extensive. This suggests that selective pressure to maintain the 155 conserved nucleotides has been influenced by the structure of the seed mRNA. RNA blot hybridizations demonstrate that mRNA encoding the other major subunit (beta) of the 7S seed storage protein also shares sequence homology with the conserved 155 nucleotide sequence of the alpha and alpha'-subunit mRNAs, but not with other coding sequences.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号