首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Since base composition of translational stop codons (TAG, TAA, and TGA) is biased toward a low G+C content, a differential density for these termination signals is expected in random DNA sequences of different base compositions. The expected length of reading frames (DNA segments of sense codons flanked by in-phase stop codons) in random sequences is thus a function of GC content. The analysis of DNA sequences from several genome databases stratified according to GC content reveals that the longest coding sequences—exons in vertebrates and genes in prokaryotes—are GC-rich, while the shortest ones are GC-poor. Exon lengthening in GC-rich vertebrate regions does not result, however, in longer vertebrate proteins, perhaps because of the lower number of exons in the genes located in these regions. The effects on coding-sequence lengths constitute a new evolutionary meaning for compositional variations in DNA GC content. Correspondence to: J. L. Oliver  相似文献   

2.
In a number of programs for gene structure prediction in higher eukaryotic genomic sequences, exon prediction is decoupled from gene assembly: a large pool of candidate exons is predicted and scored from features located in the query DNA sequence, and candidate genes are assembled from such a pool as sequences of nonoverlapping frame-compatible exons. Genes are scored as a function of the scores of the assembled exons, and the highest scoring candidate gene is assumed to be the most likely gene encoded by the query DNA sequence. Considering additive gene scoring functions, currently available algorithms to determine such a highest scoring candidate gene run in time proportional to the square of the number of predicted exons. Here, we present an algorithm whose running time grows only linearly with the size of the set of predicted exons. Polynomial algorithms rely on the fact that, while scanning the set of predicted exons, the highest scoring gene ending in a given exon can be obtained by appending the exon to the highest scoring among the highest scoring genes ending at each compatible preceding exon. The algorithm here relies on the simple fact that such highest scoring gene can be stored and updated. This requires scanning the set of predicted exons simultaneously by increasing acceptor and donor position. On the other hand, the algorithm described here does not assume an underlying gene structure model. Indeed, the definition of valid gene structures is externally defined in the so-called Gene Model. The Gene Model specifies simply which gene features are allowed immediately upstream which other gene features in valid gene structures. This allows for great flexibility in formulating the gene identification problem. In particular it allows for multiple-gene two-strand predictions and for considering gene features other than coding exons (such as promoter elements) in valid gene structures.  相似文献   

3.
Structure of the human type I DNA topoisomerase gene   总被引:7,自引:0,他引:7  
We describe the molecular organization of the human gene coding for type I DNA topoisomerase. The coding sequence is split into 21 exons distributed over at least 85 kilobase pairs (kb) of human genomic DNA. The sizes of the 20 introns vary widely between 0.2 and at least 30 kb and all contain the sequence elements known to be required for pre-mRNA splicing. Several of the intron sequences separate exons encoding parts of the enzyme that are highly conserved between human and yeast suggesting that at least some of the exons may code for individual, structurally, or functionally important domains of the enzyme. We also describe the promoter sequence of the human topoisomerase I gene and show that it is composed of distinct functional elements.  相似文献   

4.
Codon usage tables have been produced for E. coli, yeast, human, and mouse. The nonrandom employment of codons allows assignment of probability values to trinucleotides in any DNA sequence. These values represent the probability that a given trinucleotide is used as a codon in the organism from which the table is derived. For the graphical delineation of coding areas in DNA sequences, a probability is assigned to each trinucleotide equal to its frequency in the codon table. Averaging and smoothing procedures then greatly enhance the detectability of areas of high average codon probability and better represent the mean codon probability. These manipulations increase graphical clarity without altering the overall magnitude of probabilities. Averaging introduces an error of less than 0.5% between "raw" and smoothed data. This graphical delineation of coding sequences does not depend on the presence of punctuation, ribosomal binding sites, etc: moreover the delineation of introns and exons is also possible.  相似文献   

5.
Computer methods for the complete and accurate detection of genes in vertebrate genomic sequences are still a long way to perfection. The intermediate task of identifying the coding moiety of genes (coding exons) is now reasonably well achieved using a combination of methods. After reviewing the intrinsic difficulties in interpreting vertebrate genomic sequences, this article presents the state-of-the-art, with an emphasis on similarity search methods and the resources available through Internet.  相似文献   

6.
7.
8.
9.
Efficient selection of 3'-terminal exons from vertebrate DNA.   总被引:5,自引:2,他引:3       下载免费PDF全文
Identification of expressed sequences within genomic DNA is a hurdle in the characterization of complex genomes. We developed an exon trapping scheme that provides a positive selection for vertebrate 3'-terminal exons. A copy of the trapped exon sequence is obtained by RT/PCR amplification. The technique detects valid terminal exons without interference from partial exons or non-specific sequences, including simple human repeated sequences. Application to random human cosmids yielded one unique trapped terminal exon per cosmid on average. Because vertebrate terminal exons average 600-700 nucleotides in length, the technique provides transcribed sequences of sufficient length to assist further mapping efforts.  相似文献   

10.
11.
12.
T Ord  M Kolmer  R Villems  M Saarma 《Gene》1990,91(2):241-246
Two human genomic libraries were probed with bovine prochymosin (bPC) cDNA. Recombinant clones covering a genomic region homologous to the entire coding region and flanking sequences of the bPC gene were isolated. Human sequences homologous to exons of the bPC gene are distributed in a DNA fragment of 10 kb. Alignment of the human sequences and the exons of bPC reveals that the human 'exons' 1-3, 5 and 7-9 have sizes identical to the corresponding bovine exons, but a nucleotide (nt) has been deleted in the human exon 4 and two nt in the human exon 6. The aligned human sequence and the coding part of bPC gene share 82% nt homology, the value ranging, in separate exons, from 76 (exon 1) to 84% (exons 5 and 6). 150 bp of 5'-flanking sequence of the human gene has 75% homology to the corresponding region of bPC gene and contains a TATA-box in a similar position. A 1-nt deletion in the human exon 4 would shift the translational reading frame of a putative human PC mRNA relative to bPC mRNA, and result in an in-phase terminator spanning codons 163 and 164 in bPC mRNA. Another terminator in-phase with the amino-acid sequence encoded by the bPC gene occurs in the human exon 5 and the second frameshift mutation in exon 6. Thus, the nt sequence analysis of the human genomic region has revealed the presence of mutations that have rendered it unable to produce a full-length protein homologous to bPC and, therefore, we refer to this gene as a human prochymosin pseudogene (hPC psi). Blot-hybridization analysis of human genomic DNA indicates that hPC psi is a single gene in the human genome.  相似文献   

13.
Evaluation of Gene Structure Prediction Programs   总被引:2,自引:0,他引:2  
We evaluate a number of computer programs designed to predict the structure of protein coding genes in genomic DNA sequences. Computational gene identification is set to play an increasingly important role in the development of the genome projects, as emphasis turns from mapping to large-scale sequencing. The evaluation presented here serves both to assess the current status of the problem and to identify the most promising approaches to ensure further progress. The programs analyzed were uniformly tested on a large set of vertebrate sequences with simple gene structure, and several measures of predictive accuracy were computed at the nucleotide, exon, and protein product levels. The results indicated that the predictive accuracy of the programs analyzed was lower than originally found. The accuracy was even lower when considering only those sequences that had recently been entered and that did not show any similarity to previously entered sequences. This indicates that the programs are overly dependent on the particularities of the examples they learn from. For most of the programs, accuracy in this test set ranged from 0.60 to 0.70 as measured by the Correlation Coefficient (where 1.0 corresponds to a perfect prediction and 0.0 is the value expected for a random prediction), and the average percentage of exons exactly identified was less than 50%. Only those programs including protein sequence database searches showed substantially greater accuracy. The accuracy of the programs was severely affected by relatively high rates of sequence errors. Since the set on which the programs were tested included only relatively short sequences with simple gene structure, the accuracy of the programs is likely to be even lower when used for large uncharacterized genomic sequences with complex structure. While in such cases, programs currently available may still be of great use in pinpointing the regions likely to contain exons, they are far from being powerful enough to elucidate its genomic structure completely.  相似文献   

14.
白氏文昌鱼FADD的克隆及功能研究   总被引:1,自引:1,他引:0  
Fas死亡结构域相关蛋白(Fas-associated death domain protein,FADD)是死亡信号转导通路中的连接蛋白,在脊椎动物中其结构和功能都很保守.本文首次克隆了头索动物白氏文昌鱼(Branchiostoma belched)FADD(bbFADD)的cDNA和基因组DNA序列.bbFADD cDNA全长1239 bp,编码217个氨基酸.与脊椎动物的FADD一样,bbFADD含有N端的死亡效应结构域(Death Effector Domain,DED)和C端的死亡结构域(Death Domain,DD).bbFADD氨基酸序列的第33位氨基酸苯丙氨酸在进化过程中相对保守,此苯丙氨酸在FADD自我相互作用中具有重要作用.哺乳类的FADD基因编码区含有两个外显子,而bbFADD基因含有3个外显子.一般认为头索动物处在无脊椎动物进化到脊椎动物的中间过渡阶段,但基于FADD氨基酸序列的系统进化树和同源性分析显示,文昌鱼与海胆的亲缘关系更近.bbFADD在HeLa细胞中超表达能够引起HeLa细胞的凋亡,暗示bbFADD可能能够在人类细胞凋亡通路中起作用,推测凋亡系统在生物进化过程中相当保守.  相似文献   

15.
Interpolated markov chains for eukaryotic promoter recognition.   总被引:9,自引:0,他引:9  
MOTIVATION: We describe a new content-based approach for the detection of promoter regions of eukaryotic protein encoding genes. Our system is based on three interpolated Markov chains (IMCs) of different order which are trained on coding, non-coding and promoter sequences. It was recently shown that the interpolation of Markov chains leads to stable parameters and improves on the results in microbial gene finding (Salzberg et al., Nucleic Acids Res., 26, 544-548, 1998). Here, we present new methods for an automated estimation of optimal interpolation parameters and show how the IMCs can be applied to detect promoters in contiguous DNA sequences. Our interpolation approach can also be employed to obtain a reliable scoring function for human coding DNA regions, and the trained models can easily be incorporated in the general framework for gene recognition systems. RESULTS: A 5-fold cross-validation evaluation of our IMC approach on a representative sequence set yielded a mean correlation coefficient of 0.84 (promoter versus coding sequences) and 0.53 (promoter versus non-coding sequences). Applied to the task of eukaryotic promoter region identification in genomic DNA sequences, our classifier identifies 50% of the promoter regions in the sequences used in the most recent review and comparison by Fickett and Hatzigeorgiou ( Genome Res., 7, 861-878, 1997), while having a false-positive rate of 1/849 bp.  相似文献   

16.
Simulation with indels was used to produce alignments where true site homologies in DNA sequences were known; the gaps from these datasets were removed and the sequences were then aligned to produce hypothesized alignments. Both alignments were then analyzed under three widely used methods of treating gaps during tree reconstruction under the maximum parsimony principle. With the true alignments, for many cases (82%), there was no difference in topological accuracy for the different methods of gap coding. However, in cases where a difference was present, coding gaps as a fifth state character or as separate presence/absence characters outperformed treating gaps as unknown/missing data nearly 90% of the time. For the hypothesized alignments, on average, all gap treatment approaches performed equally well. Data sets with higher sequence divergence and more pectinate tree shapes with variable branch lengths are more affected by gap coding than datasets associated with shallower non-pectinate tree shapes.  相似文献   

17.
Frenkel S  Kirzhner V  Korol A 《PloS one》2012,7(2):e32076
Genomes of higher eukaryotes are mosaics of segments with various structural, functional, and evolutionary properties. The availability of whole-genome sequences allows the investigation of their structure as "texts" using different statistical and computational methods. One such method, referred to as Compositional Spectra (CS) analysis, is based on scoring the occurrences of fixed-length oligonucleotides (k-mers) in the target DNA sequence. CS analysis allows generating species- or region-specific characteristics of the genome, regardless of their length and the presence of coding DNA. In this study, we consider the heterogeneity of vertebrate genomes as a joint effect of regional variation in sequence organization superimposed on the differences in nucleotide composition. We estimated compositional and organizational heterogeneity of genome and chromosome sequences separately and found that both heterogeneity types vary widely among genomes as well as among chromosomes in all investigated taxonomic groups. The high correspondence of heterogeneity scores obtained on three genome fractions, coding, repetitive, and the remaining part of the noncoding DNA (the genome dark matter--GDM) allows the assumption that CS-heterogeneity may have functional relevance to genome regulation. Of special interest for such interpretation is the fact that natural GDM sequences display the highest deviation from the corresponding reshuffled sequences.  相似文献   

18.
Although the first sponge genome project has already started releasing completed sequences, only a very small number of annotated sponge genomic sequences has so far been published. In addition, no gene-prediction software optimised for sponges is available yet. In the present paper, we present the performance of Arabidopsis-optimised Genscan as tested on sponge genes. All genes whose genomic and complete CDS sequences are deposited in the NCBI nucleotide database were retrieved and used as the test set. The 18 test genes are composed of 114 coding exons. The sensitivity and specificity, respectively, of all exons were predicted with 83.3% and 79.2%, internal exons with 88.5% and 80.2%, donor with 93.8% and 85.7%, acceptor with 89.6% and 78.9%, initiation with 94.4% and 85%, and termination sites with 72.2% and 81.3%. The results are compared with prediction results obtained with Genscan for vertebrates and GeneMark.hmm ES-3.0 for Arabidopsis. The surprising finding is that although the animals are the source of sequences, the best results (more than 80% accuracy in predicting complete exons) were obtained by Genscan optimised for a plant A. thaliana. Although the sample is small, the results lead to the conclusion that Genscan for Arabidopsis is a valuable tool for predicting coding sequences in sponges and could be of great help in annotating sponge genes. Dedicated to Prof. Dr.Sc. Vera Gamulin, a remarkable lady who passed away too soon and suddenly.  相似文献   

19.
A recombinant phage, SpC3, containing a 17 kb genomic DNA insert representing approximately 60% of the 3' portion of the sheep collagen alpha 2 gene, was evaluated by electron microscopic R loop analysis. A minimum of 17 intervening sequences (introns) and 18 alpha 2 coding sequences (exons) were mapped. With the exception of the 850 base pair exon located at the extreme 3' end of the insert, all exons contained 250 base pairs or less. The total length of all the exons in SpC3 was 3,014 base pairs. The length distribution of the 17 introns ranged from 300 to 1600 base pairs; together, all of the introns comprised 14,070 base pairs of SpC3 DNA. Thus, the DNA region required for coding the interspersed 3 kb of alpha 2 collagen genetic information was 5.6 fold longer than the corresponding alpha 2 mRNA coding sequences.  相似文献   

20.
Y Takayama  C Wada  H Kawauchi  M Ono 《Gene》1989,80(1):65-73
Two MCH genes coding for melanin-concentrating hormone (MCH) were isolated from a chum salmon liver DNA library and characterized. They were shown to be intronless genes with 0.63-kb exons, each of which commonly consisted of an about 80-bp 5'-untranslated region, a region coding for 132 amino acids (aa) MCH precursor protein and an approx. 160-bp 3'-untranslated region. About 20 bp upstream from the putative cap site, sequences were found corresponding to the TATA box. The two genes were 86% identical at the nucleotide sequence level. Sequences homologous to the chum salmon MCH genes were present in the genomes of other fish such as catfish, carp and Chinese grass carp, whereas no highly homologous sequence could be detected in other vertebrate genomes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号