首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A novel multivariate statistical approach is presented for extracting and exploiting intrinsic information present in our ever-growing sequence data banks. The information extraction from the sequences avoids the pitfalls of intersequence alignment by analyzing secondary invariant functions derived from the sequences in the data bank rather than the sequences themselves. Such typical invariant function is a 20 x 20 histogram of occurrences of amino acid pairs in a given sequence or fragment thereof. To illustrate the potential of the approach an analysis of 10,000 protein sequences from the National Biomedical Research Foundation Protein Identification Resource is presented, whose analysis already reveals great biological detail. For example, zeta-hemoglobin is found to lie close to amphibian and fish chi-hemoglobin which, in turn, is an important clue to the physiological function of this mammalian early embryonic hemoglobin. The multivariate statistical framework presented unifies such apparently unrelated issues as phylogenetic comparisons between a set of sequences and distance matrices between the constituents of the biological sequences. The Multivariate Statistical Sequence Analysis (MSSA) principles can be used for a wide spectrum of sequence analysis problems such as: assignment of family memberships to new sequences, validation of new incoming sequences to be entered into the database, prediction of structure from sequence, discrimination of coding from non-coding DNA regions, and automatic generation of an atlas of protein or DNA sequences. The MSSA techniques represent a self-contained approach to learning continuously and automatically from the growing stream of new sequences. The MSSA approach is particularly likely to play a significant role in major sequencing efforts such as the human genome project.  相似文献   

2.
Goto N  Kurokawa K  Yasunaga T 《Gene》2007,401(1-2):172-180
To date, the complete genome sequences of more than 250 organisms have been determined. This information can now be used to determine whether there exist any invariant sequences that are conserved among all organisms, from bacteria to plants, animals, and humans. The existence of invariant sequences would strongly suggest that these sequences have been inherited unchanged from the last common ancestor of all life, and that they have essential functions. We have developed a new software program to identify invariant sequences conserved among the currently sequenced genomes and applied this analysis to the complete genome sequences of 266 organisms. We have identified 3 invariant DNA sequences longer than or equal to 11 bp and 6 invariant amino acid sequences longer than or equal to 6 aa. The longest invariant DNA sequence, AAGTCGTACAAGGT (15 bp), was found in the 16S/18S rRNA gene. Two 8 aa sequences, GHVDHGKT in IF2 and EF-Tu and DTPGHVDF in EF-G, were the longest invariant amino acid sequences detected. These sequences could be essential elements from the genome of the last common ancestor and may have remained unchanged throughout evolution.  相似文献   

3.
We consider construction of a characteristic distribution of an L-tuple in a DNA sequence. The mathematical characteristic of the characteristic distribution is selected as invariant to characterize the L-tuple. With the invariant, we can perform the sequence comparison. The graphs of characteristic distributions of dinucleotide GC for the coding sequences of the first exon of beta-globin gene of eleven different species and the construction of phylogenetic tree of twenty four coronavirus genomes illustrate the utility of the approach.  相似文献   

4.
To date, methanogens are the only group within the archaea where firing DNA replication origins have not been demonstrated in vivo. In the present study we show that a previously identified cluster of ORB (origin recognition box) sequences do indeed function as an origin of replication in vivo in the archaeon Methanothermobacter thermautotrophicus. Although the consensus sequence of ORBs in M. thermautotrophicus is somewhat conserved when compared with ORB sequences in other archaea, the Cdc6-1 protein from M. thermautotrophicus (termed MthCdc6-1) displays sequence-specific binding that is selective for the MthORB sequence and does not recognize ORBs from other archaeal species. Stabilization of in vitro MthORB DNA binding by MthCdc6-1 requires additional conserved sequences 3' to those originally described for M. thermautotrophicus. By testing synthetic sequences bearing mutations in the MthORB consensus sequence, we show that Cdc6/ORB binding is critically dependent on the presence of an invariant guanine found in all archaeal ORB sequences. Mutation of a universally conserved arginine residue in the recognition helix of the winged helix domain of archaeal Cdc6-1 shows that specific origin sequence recognition is dependent on the interaction of this arginine residue with the invariant guanine. Recognition of a mutated origin sequence can be achieved by mutation of the conserved arginine residue to a lysine or glutamine residue. Thus despite a number of differences in protein and DNA sequences between species, the mechanism of origin recognition and binding appears to be conserved throughout the archaea.  相似文献   

5.
While remarkably complex networks of connected DNA molecules can form from a relatively small number of distinct oligomer strands, a large computational space created by DNA reactions would ultimately require the use of many distinct DNA strands. The automatic synthesis of this many distinct strands is economically prohibitive. We present here a new approach to producing distinct DNA oligomers based on the polymerase chain reaction (PCR) amplification of a few random template sequences. As an example, we designed a DNA template sequence consisting of a 50-mer random DNA segment flanked by two 20-mer invariant primer sequences. Amplification of a dilute sample containing about 30 different template molecules allows us to obtain around 1011 copies of these molecules and their complements. We demonstrate the use of these amplicons to implement some of the vector operations that will be required in a DNA implementation of an analog neural network.  相似文献   

6.
Predictive motifs derived from cytosine methyltransferases.   总被引:36,自引:51,他引:36       下载免费PDF全文
Thirteen bacterial DNA methyltransferases that catalyze the formation of 5-methylcytosine within specific DNA sequences possess related structures. Similar building blocks (motifs), containing invariant positions, can be found in the same order in all thirteen sequences. Five of these blocks are highly conserved while a further five contain weaker similarities. One block, which has the most invariant residues, contains the proline-cysteine dipeptide of the proposed catalytic site. A region in the second half of each sequence is unusually variable both in length and sequence composition. Those methyltransferases that exhibit significant homology in this region share common specificity in DNA recognition. The five highly conserved motifs can be used to discriminate the known 5-methylcytosine forming methyltransferases from all other methyltransferases of known sequence, and from all other identified proteins in the PIR, GenBank and EMBL databases. These five motifs occur in a mammalian methyltransferase responsible for the formation of 5-methylcytosine within CG dinucleotides. By searching the unidentified open reading frames present in the GenBank and EMBL databases, two potential 5-methylcytosine forming methyltransferases have been found.  相似文献   

7.
应用RACE法克隆鸽恒定链基因的研究   总被引:1,自引:0,他引:1  
刘岗  仲大莲  刘雪兰  余为一 《遗传》2008,30(1):77-80
为比较禽类恒定链的结构和功能, 应用RACE (Rapid Amplification of cDNA Ends) 技术首次克隆并鉴定了鸽恒定链基因。首先用一对含高度保守的DNA片段的简并引物, 从鸽脾细胞RNA扩增部分恒定链片段, 接着测序并设计新引物分别从5′和3′RACE扩增延长该片段。最后根据全基因的序列设计上、下游引物,获得大小为1 050 bp的全长cDNA。比较核苷酸序列, 鸽与鸡的Ii链同源性达到82.8%, 而与人等其它动物的同源性则在52.0%以上; 其中633 bp的开放阅读框编码211个氨基酸残基的前体蛋白。推导和分析氨基酸序列表明, 分子结构与鸡恒定链相似, 其中有些氨基酸残基表现出较高的保守性。  相似文献   

8.
Although probabilistic models of genotype (e.g., DNA sequence) evolution have been greatly elaborated, less attention has been paid to the effect of phenotype on the evolution of the genotype. Here we propose an evolutionary model and a Bayesian inference procedure that are aimed at filling this gap. In the model, RNA secondary structure links genotype and phenotype by treating the approximate free energy of a sequence folded into a secondary structure as a surrogate for fitness. The underlying idea is that a nucleotide substitution resulting in a more stable secondary structure should have a higher rate than a substitution that yields a less stable secondary structure. This free energy approach incorporates evolutionary dependencies among sequence positions beyond those that are reflected simply by jointly modeling change at paired positions in an RNA helix. Although there is not a formal requirement with this approach that secondary structure be known and nearly invariant over evolutionary time, computational considerations make these assumptions attractive and they have been adopted in a software program that permits statistical analysis of multiple homologous sequences that are related via a known phylogenetic tree topology. Analyses of 5S ribosomal RNA sequences are presented to illustrate and quantify the strong impact that RNA secondary structure has on substitution rates. Analyses on simulated sequences show that the new inference procedure has reasonable statistical properties. Potential applications of this procedure, including improved ancestral sequence inference and location of functionally interesting sites, are discussed.  相似文献   

9.
A new class of lowly repetitive DNA sequences has been detected in the primate genome. The renaturation rate of this sequence class is practically indistinguishable from the renaturation rate of single-copy sequences. Consequently, this lowly repetitive sequence class has not been previously observed in DNA renaturation rate studies. This new sequence class is significant in that it might occupy a major fraction of the primate genome.Based on a study of the thermal stabilities of DNA heteroduplexes constructed from human DNA and either bonnet monkey or galago DNAs, we are able to compare the relative mutation rates of repetitive and single-copy sequences in the primate genome. We find that the mutation rate of short, interspersed repetitive sequences is either less than or approximately equal to the mutation rate of single-copy sequences. This implies that the base sequence of these repetitive sequences is important to their biological function.We also find that numerous mutations have accumulated in interspersed repeated sequences since the divergence of galago and human. These mutations are only recognizable because they occur at specific sites in the repeated sequence rather than at random sites in the sequence. Although interspersed repetitive sequences from human and galago can readily cross-hybridize, these site-specific mutations identify them as being two distinct classes. In contrast, far fewer site-specific mutations have occurred since the divergence of human and monkey.  相似文献   

10.
Chaos game representation of gene structure.   总被引:21,自引:2,他引:19       下载免费PDF全文
This paper presents a new method for representing DNA sequences. It permits the representation and investigation of patterns in sequences, visually revealing previously unknown structures. Based on a technique from chaotic dynamics, the method produces a picture of a gene sequence which displays both local and global patterns. The pictures have a complex structure which varies depending on the sequence. The method is termed Chaos Game Representation (CGR). CGR raises a new set of questions about the structure of DNA sequences, and is a new tool for investigating gene structure.  相似文献   

11.
DNA Strider is a new integrated DNA and Protein sequence analysis program written with the C language for the Macintosh Plus, SE and II computers. It has been designed as an easy to learn and use program as well as a fast and efficient tool for the day-to-day sequence analysis work. The program consists of a multi-window sequence editor and of various DNA and Protein analysis functions. The editor may use 4 different types of sequences (DNA, degenerate DNA, RNA and one-letter coded protein) and can handle simultaneously 6 sequences of any type up to 32.5 kB each. Negative numbering of the bases is allowed for DNA sequences. All classical restriction and translation analysis functions are present and can be performed in any order on any open sequence or part of a sequence. The main feature of the program is that the same analysis function can be repeated several times on different sequences, thus generating multiple windows on the screen. Many graphic capabilities have been incorporated such as graphic restriction map, hydrophobicity profile and the CAI plot- codon adaptation index according to Sharp and Li. The restriction sites search uses a newly designed fast hexamer look-ahead algorithm. Typical runtime for the search of all sites with a library of 130 restriction endonucleases is 1 second per 10,000 bases. The circular graphic restriction map of the pBR322 plasmid can be therefore computed from its sequence and displayed on the Macintosh Plus screen within 2 seconds and its multiline restriction map obtained in a scrolling window within 5 seconds.  相似文献   

12.
13.
Currently, there is no effective therapy for cryptosporidiosis and it is unclear why antifolate drugs which are effective treatments for infections caused by closely related parasites are not also effective against Cryptosporidium parvum. In protozoa, the target of these drugs, dihydrofolate reductase (DHFR), exists as a bifunctional enzyme also manifesting thymidylate synthase (TS) activity and is encoded by a fused DHFR-TS gene. In order to prepare a probe to isolate the C. parvum DHFR-TS gene we have used degenerate oligonucleotides whose sequences are based on strongly conserved regions of TS protein sequence to prime the polymerase chain reaction (PCR) with C. parvum DNA. The PCR amplified a 375-bp DNA fragment which was cloned and sequenced; the deduced amino acid sequence had significant identity with known TS sequences, including strict conservation of all phylogenetically invariant TS amino acid residues. The cloned PCR fragment was used as a probe to isolate a number of overlapping clones from a C. parvum genomic library which were definitively shown to be of cryptosporidial origin by genomic Southern and molecular karyotype analyses. The deduced protein sequence of C. parvum TS was most similar to the bifunctional TS enzymes of Plasmodium chabaudi and Plasmodium falciparum.  相似文献   

14.
Nuclear DNA of metazoans is organized in supercoiled loops anchored to a proteinaceous substructure known as the nuclear matrix (NM). DNA is anchored to the NM by non-coding sequences known as matrix attachment regions (MARs). There are no consensus sequences for identification of MARs and not all potential MARs are actually bound to the NM constituting loop attachment regions (LARs). Fundamental processes of nuclear physiology occur at macromolecular complexes organized on the NM; thus, the topological organization of DNA loops must be important. Here, we describe a general method for determining the structural DNA loop organization in any large genomic region with a known sequence. The method exploits the topological properties of loop DNA attached to the NM and elementary topological principles such as that points in a deformable string (DNA) can be positionally mapped relative to a position-reference invariant (NM), and from such mapping, the configuration of the string in third dimension can be deduced. Therefore, it is possible to determine the specific DNA loop configuration without previous characterization of the LARs involved. We determined in hepatocytes and B-lymphocytes of the rat the DNA loop organization of a genomic region that contains four members of the albumin gene family.  相似文献   

15.
The basic-helix-loop-helix-zipper (bHLH-Zip) motif is a conserved region of approximately 70 amino acids that mediates both sequence-specific DNA binding and protein dimerization. This motif is found in protein sequences from many eukaryotic organisms and is contained in the protein sequence of the oncogene myc and its partner max, and a shortened version of the motif (bHLH) is found in the muscle determination factor myoD and its partner E12. An evaluation of the conserved amino acids that define the motif coupled with the published mutagenic studies of this region has led to our formulation of a molecular model for the binding of this motif as a dimer to specific sequences of DNA. This model has the dimeric protein interacting with an abutted, dyad-symmetric DNA sequence. Helix 2 of each monomer is modeled as a coiled-coil extension of the C-terminal "leucine zipper." Helix 1 does not interact with helix 1 from its partner in the dimer but with the hydrophobic surface created when the helix 2 regions of the dimer interact with each other as a coiled-coil. Sequence-specific interactions are proposed between the basic region and the invariant cis elements that all bHLH-Zip proteins bind.  相似文献   

16.
Ten new wheat γ-gliadin gene sequences are reported and an analysis of γ-gliadin gene family structure is carried out using all known γ-gliadin sequences. The new sequences comprise four genomic clones with significantly more flanking DNA than previously reported, and six cDNA clones from a wheat endosperm EST project. Analysis of extended flanking DNA from the genomic clones indicates the limits of conservation of γ-gliadin DNA sequence that are similar to those previously found with other gliadin and glutenin genes and that are theorized to define the DNA sequence necessary for gene control. Most of the flanking DNA is not homologous to any reported DNA sequence, and one flanking region contains the first MITE-like (miniature inverted transposable element) DNA sequence associated with gliadin genes. About a quarter of the encoded polypeptides would contain a free cysteine residue – an observation that may relate to reports that at least some gliadins can participate in wheat endosperm glutenin polymer formation. The new sequences represent both genes closely related to those previously reported and a new sub-class of γ-gliadins.  相似文献   

17.
DNA序列信息的一种新的测度   总被引:4,自引:3,他引:1  
根据信息理论给出了测度DNA序列信息的一种新的方法,获得DNA序列4个层次的信息量测度:Ib,If(1),If(2)andIf(3),这4种信息测度可分别用来测度DNA的碱基序列、密码子序列、编码蛋白质序列和功能蛋白质序列的信息量。从M.edulis的线粒体基因组中两个较短的编码蛋白质的DNA序列和使用具有不同倍性的间并密码子组组成的模拟DNA序列中所获得计算结果表明,这些信息测度确实能用来揭示所  相似文献   

18.
D M Peffley  M L Sogin 《Biochemistry》1981,20(14):4015-4021
Using a total tRNA population labeled with 32P, we have cloned a number of tRNA genes from Dictyostelium discoideum. A partial sequence of a cloned 1250-base-pair DNA insert, pDT-513, revealed the occurrence of a putative tRNATrp gene. In addition to the cloverleaf secondary structure, the tRNATrp gene contained all of the invariant and semiinvariant residues found in most tRNA sequences and has a 13-base-pair intron which is located one base removed from the 3' residue of the anticodon. The genomic distribution of the tRNA gene and its flanking sequences was examined via Southern annealing experiments. The structural gene is represented on at least six EcoRI fragments in the D. discoideum genome. Sequences flanking the 5' terminus of the cloned gene are repeated many times in the genome while the sequence flanking the 3' terminus of the pDT-513 DNA insert structural tRNA gene is present only once in the genome.  相似文献   

19.
DNA sequence is an important determinant of the positioning, stability, and activity of nucleosomes, yet the molecular basis of these effects remains elusive. A "consensus DNA sequence" for nucleosome positioning has not been reported and, while certain DNA sequence preferences or motifs for nucleosome positioning have been discovered, how they function is not known. Here, we report that an unexpected observation concerning the reassembly of nucleosomes during salt gradient dialysis has allowed a breakthrough in our efforts to identify the nucleosomal locations of the DNA sequence motifs that dominate histone-DNA interactions and nucleosome positioning. We conclude that a previous selection experiment for high-affinity, nucleosome-forming DNA sequences exerted selective pressure chiefly on the central stretch of the nucleosomal DNA. This observation implies that algorithms for aligning the selected DNA sequences should seek to optimize the alignment over much less than the full 147 bp of nucleosomal DNA. A new alignment calculation implemented these ideas and successfully aligned 19 of the 41 sequences in a non-redundant database of selected high-affinity, nucleosome-positioning sequences. The resulting alignment reveals strong conservation of several stretches within a central 71 bp of the nucleosomal DNA. The alignment further reveals an inherent palindromic symmetry in the selected DNAs; it makes testable predictions of nucleosome positioning on the aligned sequences and for the creation of new positioning sequences, both of which are upheld experimentally; and it suggests new signals that may be important in translational nucleosome positioning.  相似文献   

20.
This paper presents a new approach for modeling of DNA sequences for the purpose of exon detection. The proposed model adopts the sum-of-sinusoids concept for the representation of DNA sequences. The objective of the modeling process is to represent the DNA sequence with few coefficients. The modeling process can be performed on the DNA signal as a whole or on a segment-by-segment basis. The created models can be used instead of the original sequences in a further spectral estimation process for exon detection. The accuracy of modeling is evaluated evaluated by using the Root Mean Square Error (RMSE) and the R-square metrics. In addition, non-parametric spectral estimation methods are used for estimating the spectral of both original and modeled DNA sequences. The results of exon detection based on original and modeled DNA sequences coincide to a great extent, which ensures the success of the proposed sum-of-sinusoids method for modeling of DNA sequences.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号