首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Zhang CT  Wang J 《Nucleic acids research》2000,28(14):2804-2814
The Z curve is a three-dimensional space curve constituting the unique representation of a given DNA sequence in the sense that each can be uniquely reconstructed from the other. Based on the Z curve, a new protein coding gene-finding algorithm specific for the yeast genome at better than 95% accuracy has been proposed. Six cross-validation tests were performed to confirm the above accuracy. Using the new algorithm, the number of protein coding genes in the yeast genome is re-estimated. The estimate is based on the assumption that the unknown genes have similar statistical properties to the known genes. It is found that the number of protein coding genes in the 16 yeast chromosomes is ≤5645, significantly smaller than the 5800–6000 which is widely accepted, and much larger than the 4800 estimated by another group recently. The mitochondrial genes were not included into the above estimate. A codingness index called the YZ score (YZ Œ [0,1]) is proposed to recognize protein coding genes in the yeast genome. Among the ORFs annotated in the MIPS (Munich Information Centre for Protein Sequences) database, those recognized as non-coding by the present algorithm are listed in this paper in detail. The criterion for a coding or non-coding ORF is simply decided by YZ > 0.5 or YZ < 0.5, respectively. The YZ scores for all the ORFs annotated in the MIPS database have been calculated and are available on request by sending email to the corresponding author.  相似文献   

2.
Long Open Reading Frames (ORFs) in antisense DNA strands have been reported in the literature as being rare events. However, an extensive analysis of the GenBank database revealed that a substantial number of genes from several species contain an in-phase ORF in the antisense strand, that overlaps entirely the coding sequence of the sense strand, or even extends beyond. The findings described in this paper show that this is a frequent, non-random phenomenon, which is primarily dependent on codon usage, and to a lesser extent on gene size and GC content. Examination of the sequence database for several prokaryotic and eukaryotic organisms, demonstrates that coding sequences with in-phase, 100% overlapping antisense ORFs are present in every genome studied so far.  相似文献   

3.
The bacterial DNA sequence in GenBank database were divided into coding and noncoding regions and examined for the base-trimer distribution in every triplet frame on the sense and antisense strands. The results revealed that for the noncoding region, both strands have very similar base-trimer distributions and have no frame specificity; that is, DNA is symmetric in the noncoding region. For the coding region, on the other hand, the symmetry is broken only in the triplet framework, and we found a special triplet-frame-specific symmetry which appears when the two complementary strands of the coding region are read from their 5 ends. In addition, the following frame specificity was also observed in the distribution of stop codons on the antisense strand of the coding region. When the antisense sequences of the open reading frames (ORFs) in the database are read in the three reading frames, the same reading frame as the corresponding ORF contains a significantly larger amount of long open frames without stop codons (i.e., nonstop frames [NSFs]) than expected, while the number of NSFs in the other two reading frames is similar to that of the expected one. That is, NSFs as well as ORFs are maintained in a frame-specific manner, and in this sense, DNA becomes symmetrical even in the coding region. These two kinds of frame-specific symmetries indicate that only an ORF and its complementary triplets are specifically recognized and maintained in DNA. We suppose that the antisense strands as well as the sense strands in the coding region may be transcribed, thereby producing various kinds of proteins corresponding to NSFs, though their amount may not be large. The presence of these proteins should have some benefits for living organisms, and therefore we propose that these proteins are upcoming enzymes having novel functions.Correspondence to: I. Urabe  相似文献   

4.
After 50 years of analysing Neurospora crassa genes one by one large scale sequence analysis has increased the number of accessible genes tremendously in the last few years. Being the only filamentous fungus for which a comprehensive genomic sequence database is publicly accessible N. crassa serves as the model for this important group of microorganisms. The MIPS N. crassa database currently holds more than 16 Mb of non-redundant data of the chromosomes II and V analysed by the German Neurospora Genome Project. This represents more than one-third of the genome. Open reading frames (ORFs) have been extracted from the sequence and the deduced proteins have been annotated extensively. They are classified according to matches in sequence databases and attributed to functional categories according to their relatives. While 41% of analysed proteins are related to known proteins, 30% are hypothetical proteins with no match to a database entry. The entire genome is expected to comprise some 13000 protein coding genes, more than twice as many as found in yeasts, and reflects the high potential of filamentous fungi to cope with various environmental conditions.  相似文献   

5.
Peptide mass fingerprinting (PMF) has become one of the most widely used methods for rapid identification of proteins in proteomics research. Many peaks, however, remain unassigned after PMF analysis, partly because of post-translational modification and the limited scope of protein sequences. Almost all PMF tools employ only known or predicted protein sequences and do not include open reading frames (ORFs) in the genome, which eliminates the chance of finding novel functional peptides. Unlike most tools that search protein sequences from known coding sequences, the tool we developed uses a database for theoretical small ORFs (tsORFs) and a PMF application using a tsORFs database (tsORFdb). The tsORFdb is a database for ORFeome that encompasses all potential tsORFs derived from whole genome sequences as well as the predicted ones. The massProphet system tries to extend the search scope to include the ORFeome using the tsORFdb. The tsORFdb and massProphet should be useful for proteomics research to give information about unknown small ORFs as well as predicted and registered proteins.  相似文献   

6.
7.
《Gene》1997,194(1):143-155
In recent studies it has been suggested that long reading frames on the antisense strand of open reading frames (ORFs) are more frequent than expected. The vertebrate DNA database was searched for long (greater than 900 bp) antisense non-stop reading frames (aNRFs) that overlap known coding regions. The sequences obtained were predominantly positioned in DNA with a high usage of Gor C in the third codon position of the sense ORF. The major class of sequences revealed by the search was that of the heat-shock protein 70 kDa (Hsp70) family. A long Hsp70 aNRF was found in many Hsp70 sequences and occurred in species as diverse as fish, flies, fungi and bacteria. The role of codon usage bias was analysed both in the specific case of the Hsp70 genes and in a general species-wide context. The data obtained showed that even the very long aNRFs present in the Hsp70 family could be explained by codon usage bias on the sense strand. Codon usage bias is determined by GC content at the third codon position of the sense ORF and, in some species, by a high expression level of the gene in question. Such an explanation for the occurrence of long aNRFs cannot exclude that some aNRFs are transcribed and translated.  相似文献   

8.
9.
10.
MOTIVATION: Overlapping gene coding sequences (CDSs) are particularly common in viruses but also occur in more complex genomes. Detecting such genes with conventional gene-finding algorithms can be difficult for several reasons. If an overlapping CDS is on the same read-strand as a known CDS, then there may not be a distinct promoter or mRNA. Furthermore, the constraints imposed by double-coding can result in atypical codon biases. However, these same constraints lead to particular mutation patterns that may be detectable in sequence alignments. RESULTS: In this paper, we investigate several statistics for detecting double-coding sequences with pairwise alignments--including a new maximum-likelihood method. We also develop a model for double-coding sequence evolution. Using simulated sequences generated with the model, we characterize the distribution of each statistic as a function of sequence composition, length, divergence time and double-coding frame. Using these results, we develop several algorithms for detecting overlapping CDSs. The algorithms were tested on known overlapping CDSs and other overlapping open reading frames (ORFs) in the hepatitis B virus (HBV), Escherichia coli and Salmonella typhimurium genomes. The algorithms should prove useful for detecting novel overlapping genes--especially short coding ORFs in viruses. AVAILABILITY: Programs may be obtained from the authors. SUPPLEMENTARY INFORMATION: http://biochem.otago.ac.nz/double.html.  相似文献   

11.
Insertional mutagenesis is a powerful tool for generating knockout mutations that facilitate associating biological functions with as yet uncharacterized open reading frames (ORFs) identified by genomic sequencing or represented in EST databases. We have generated a collection of Dissociation (Ds) transposon lines with insertions on all 5 Arabidopsis chromosomes. Here we report the insertion sites in 260 independent single-transposon lines, derived from four different Ds donor sites. We amplified and determined the genomic sequence flanking each transposon, then mapped its insertion site by identity of the flanking sequences to the corresponding sequence in the Arabidopsis genome database. This constitutes the largest collection of sequence-mapped Ds insertion sites unbiased by selection against the donor site. Insertion site clusters have been identified around three of the four donor sites on chromosomes 1 and 5, as well as near the nucleolus organizers on chromosomes 2 and 4. The distribution of insertions between ORFs and intergenic sequences is roughly proportional to the ratio of genic to intergenic sequence. Within ORFs, insertions cluster near the translational start codon, although we have not detected insertion site selectivity at the nucleotide sequence level. A searchable database of insertion site sequences for the 260 transposon insertion sites is available at http://sgio2.biotec.psu.edu/sr. This and other collections of Arabidopsis lines with sequence-identified transposon insertion sites are a valuable genetic resource for functional genomics studies because the transposon location is precisely known, the transposon can be remobilized to generate revertants, and the Ds insertion can be used to initiate further local mutagenesis.  相似文献   

12.
Gene overlap occurs when two or more genes are encoded by the same nucleotides. This phenomenon is found in all taxonomic domains, but is particularly common in viruses, where it may increase the information content of compact genomes or influence the creation of new genes. Here we report a global comparative study of overlapping open reading frames (OvRFs) of 12,609 virus reference genomes in the NCBI database. We retrieved metadata associated with all annotated open reading frames (ORFs) in each genome record to calculate the number, length, and frameshift of OvRFs. Our results show that while the number of OvRFs increases with genome length, they tend to be shorter in longer genomes. The majority of overlaps involve +2 frameshifts, predominantly found in dsDNA viruses. Antisense overlaps in which one of the ORFs was encoded in the same frame on the opposite strand (−0) tend to be longer. Next, we develop a new graph-based representation of the distribution of overlaps among the ORFs of genomes in a given virus family. In the absence of an unambiguous partition of ORFs by homology at this taxonomic level, we used an alignment-free k-mer based approach to cluster protein coding sequences by similarity. We connect these clusters with two types of directed edges to indicate (1) that constituent ORFs are adjacent in one or more genomes, and (2) that these ORFs overlap. These adjacency graphs not only provide a natural visualization scheme, but also a novel statistical framework for analyzing the effects of gene- and genome-level attributes on the frequencies of overlaps.  相似文献   

13.
14.
Summary The size distribution of 411 randomly selected mammalian exons was investigated. This distribution was found to be unimodal with a frequency maximum of 120 bp. Detailed analysis of the distribution demonstrated that larger exons (>150 bp) have a high goodness of fit to the size distribution of open reading frames (ORFs) in a random sequence, i.e., (61/64)t in which t is the number of triplets. Based on this observation, the general character of the total exon size distribution suggested that this could be defined by a theoretical distribution by superimposing a sigmoid function on the ORF generating function, i.e., (61/64)t×fs(t)×E in which fs(t) is a sigmoid function and E is a constant. We tested this distribution for fitness to the exon distribution using two sigmoid functions. fs(t)=(t) and fs(t)=Bekt/1+Bekt. In both cases a very high goodness of fit was attained. It is concluded that exons have been generated from ORFs in random sequences, that ORFs larger than 150 bp have been selected, irrespective of size, as exons, and that a lower size limit exists below which the probability of an ORF being selected as an exon is very low. These results provide evidence at the molecular level to support the ideas that (1) larger exons have been selected from random ORFs without primary correlation to structural or functional properties at the protein level, (2) there exists a restriction on smaller ORFs to be selected as exons, and (3) the interrupted coding sequences found in eukaryotes represent the ancient form of gene organization that existed prior to the divergence of prokaryotes and eukaryotes.  相似文献   

15.
Expressed sequence tag (EST) databases contain a significant number (5-20%) of reversed, antisense, cDNA sequences that can be recognized by the label "reversed clone: similarity on wrong strand" in the annotations to the sequence. Despite this high number of altered sequences, no attempt has been made to explain the alteration in molecular terms, or to evaluate their effect on the quality of the information curated in EST databases. In this paper we try to explain the way these altered sequences are originated, and propose a plausible mechanism: a "double priming" of the first strand oligo-dT primer at both ends of nascent cDNAs. In this way, a symmetrical cDNA intermediate is generated, an intermediate that can be cloned after partial digestion with the restriction enzyme used for the directional cloning. Furthermore, when "secondary" priming takes place inside the cDNA, the chain synthesized is prone to be truncated prematurely, with the subsequent loss of upstream information. One of the most subtle effects of this cloning alteration is the generation of virtual open reading frames (ORFs) in sequences with no homologues available for comparison. Nevertheless, and according to our model and our data, the "double priming mechanism" does not shift the ORF effected, so antisense sequences should be considered as normal ones after a simple transformation in their inverse-complementary forms.  相似文献   

16.
Complete DNA sequence of yeast chromosome II.   总被引:20,自引:2,他引:18       下载免费PDF全文
In the framework of the EU genome-sequencing programmes, the complete DNA sequence of the yeast Saccharomyces cerevisiae chromosome II (807 188 bp) has been determined. At present, this is the largest eukaryotic chromosome entirely sequenced. A total of 410 open reading frames (ORFs) were identified, covering 72% of the sequence. Similarity searches revealed that 124 ORFs (30%) correspond to genes of known function, 51 ORFs (12.5%) appear to be homologues of genes whose functions are known, 52 others (12.5%) have homologues the functions of which are not well defined and another 33 of the novel putative genes (8%) exhibit a degree of similarity which is insufficient to confidently assign function. Of the genes on chromosome II, 37-45% are thus of unpredicted function. Among the novel putative genes, we found several that are related to genes that perform differentiated functions in multicellular organisms of are involved in malignancy. In addition to a compact arrangement of potential protein coding sequences, the analysis of this chromosome confirmed general chromosome patterns but also revealed particular novel features of chromosomal organization. Alternating regional variations in average base composition correlate with variations in local gene density along chromosome II, as observed in chromosomes XI and III. We propose that functional ARS elements are preferably located in the AT-rich regions that have a spacing of approximately 110 kb. Similarly, the 13 tRNA genes and the three Ty elements of chromosome II are found in AT-rich regions. In chromosome II, the distribution of coding sequences between the two strands is biased, with a ratio of 1.3:1. An interesting aspect regarding the evolution of the eukaryotic genome is the finding that chromosome II has a high degree of internal genetic redundancy, amounting to 16% of the coding capacity.  相似文献   

17.
18.
Insertional mutagenesis is a powerful tool for generating knockout mutations that facilitate associating biological functions with as yet uncharacterized open reading frames (ORFs) identified by genomic sequencing or represented in EST databases. We have generated a collection of Dissociation(Ds) transposon lines with insertions on all 5 Arabidopsischromosomes. Here we report the insertion sites in 260 independent single-transposon lines, derived from four different Ds donor sites. We amplified and determined the genomic sequence flanking each transposon, then mapped its insertion site by identity of the flanking sequences to the corresponding sequence in the Arabidopsisgenome database. This constitutes the largest collection of sequence-mapped Ds insertion sites unbiased by selection against the donor site. Insertion site clusters have been identified around three of the four donor sites on chromosomes 1 and 5, as well as near the nucleolus organizers on chromosomes 2 and 4. The distribution of insertions between ORFs and intergenic sequences is roughly proportional to the ratio of genic to intergenic sequence. Within ORFs, insertions cluster near the translational start codon, although we have not detected insertion site selectivity at the nucleotide sequence level. A searchable database of insertion site sequences for the 260 transposon insertion sites is available at http://sgio2.biotec.psu.edu/sr. This and other collections of Arabidopsislines with sequence-identified transposon insertion sites are a valuable genetic resource for functional genomics studies because the transposon location is precisely known, the transposon can be remobilized to generate revertants, and the Ds insertion can be used to initiate further local mutagenesis.  相似文献   

19.
Summary The strategy and implementation of a unique system for engineering bacteriophage resistant starter cultures ofLactococcus lactis employing antisense RNA is reviewed. As a necessary prerequisite for developing this system, we have cloned and sequenced a number of bacteriophage genes coding for minor and major structural proteins. In addition, we have also identified a series of genes whose function(s) is not known but their sequences appear to be conserved in a vast number of isolates. One of these latter sequences, designatedgp51C, codes for a 51-kDa protein which is extremely charged and shares some homology with yeast translation intiation factor. Resistance to a broad class of isometric bacteriophages has been achieved by expression of an antisense RNA targeted against, for example,gp51C. In the best case, expression of the antisensegp51C RNA results is a greater than 99% reduction in the total number of plaque forming units. Additional antisense RNA constructs directed against other bacteriophage genes, including the major capsid protein, also appear effective at inhibiting infection from 40–55% suggesting that this approach may prove useful for engineering a set of truly isogenic strains to be used in a starter culture rotation plan.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号