首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Origin and properties of non-coding ORFs in the yeast genome.   总被引:4,自引:0,他引:4       下载免费PDF全文
In a recent paper we have estimated the total number of protein coding open reading frames (ORFs) in the Saccharomyces cerevisiae genome, based on their properties, at about 4800. This number is much smaller than the 5800-6000 which is widely accepted. In this paper we analyse differences between the set of ORFs with known phenotypes annotated in the Munich Information Centre for Protein Sequences (MIPS) database and ORFs for which the probability of coding, counted by us, is very low. We have found that many of the latter ORFs have properties of antisense sequences of coding ORFs, which suggests that they could have been generated by duplication of coding sequences. Since coding sequences generate ORFs inside themselves, with especially high frequency in the antisense sequences, we have looked for homology between known proteins and hypothetical polypeptides generated by ORFs under consideration in all the six phases. For many ORFs we have found paralogues and orthologues in phases different than the phase which had been assumed in the MIPS database as coding.  相似文献   

2.
In this paper, we re-annotated the genome of Pyrobaculum aerophilum str. IM2, particularly for hypothetical ORFs. The annotation process includes three parts. Firstly and most importantly, 23 new genes, which were missed in the original annotation, are found by combining similarity search and the ab initio gene finding approaches. Among these new genes, five have significant similarities with function-known genes and the rest have significant similarities with hypothetical ORFs contained in other genomes. Secondly, the coding potentials of the 1645 hypothetical ORFs are re-predicted by using 33 Z curve variables combined with Fisher linear discrimination method. With the accuracy being 99.68%, 25 originally annotated hypothetical ORFs are recognized as non-coding by our method. Thirdly, 80 hypothetical ORFs are assigned with potential functions by using similarity search with BLAST program. Re-annotation of the genome will benefit related researches on this hyperthermophilic crenarchaeon. Also, the re-annotation procedure could be taken as a reference for other archaeal genomes. Details of the revised annotation are freely available at http://cobi.uestc.edu.cn/resource/paero/  相似文献   

3.
Abstract

In this paper, we re-annotated the genome of Pyrobaculum aerophilum str. IM2, particularly for hypothetical ORFs. The annotation process includes three parts. Firstly and most importantly, 23 new genes, which were missed in the original annotation, are found by combining similarity search and the ab initio gene finding approaches. Among these new genes, five have significant similarities with function-known genes and the rest have significant similarities with hypothetical ORFs contained in other genomes. Secondly, the coding potentials of the 1645 hypothetical ORFs are re-predicted by using 33 Z curve variables combined with Fisher linear discrimination method. With the accuracy being 99.68%, 25 originally annotated hypothetical ORFs are recognized as non-coding by our method. Thirdly, 80 hypothetical ORFs are assigned with potential functions by using similarity search with BLAST program. Re-annotation of the genome will benefit related researches on this hyperthermophilic crenarchaeon. Also, the re-annotation procedure could be taken as a reference for other archaeal genomes. Details of the revised annotation are freely available at http://cobi.uestc.edu.cn/resource/paero/  相似文献   

4.
The 2694 ORFs originally annotated as potential genes in the genome of Aeropyrum pernix can be categorized into three clusters (A, B, C), according to their nucleotide composition at three codon positions. Coding potential was found to be responsible for the phenomenon of three clusters in a 9-dimensional space derived from the nucleotide composition of ORFs: ORFs assigned to cluster A are coding ones, while those assigned to clusters B and C are non-coding ORFs. A "codingness" index called the AZ score is defined based on a clustering method used to recognize protein-coding genes in the A. pernix genome. The criterion for a coding or non-coding ORF is based on the AZ score. ORFs with AZ > 0 or AZ < 0 are coding or non-coding, respectively. Consequently, 620 out of 632 ORFs with putative functions based on the original annotation are contained in cluster A, which have positive AZ scores. In addition, all 29 ORFs encoding putative or conserved proteins newly added in RefSeq annotation also have positive AZ scores. Accordingly, the number of re-recognized protein-coding genes in the A. pernix genome is 1610, which is significantly less than 2694 in the original annotation and also much less than 1841 in the RefSeq annotation curated by NCBI staff. Annotation information of re-recognized genes and their AZ scores are available at: http://tubic.tju.edu.cn/Aper/.  相似文献   

5.
We discuss the problems in defining the extent of the proteomes for completely sequenced eukaryotic organisms (i.e. the total number of protein-coding sequences), focusing on yeast, worm, fly and human. (i) Six years after completion of its genome sequence, the true size of the yeast proteome is still not defined. New small genes are still being discovered, and a large number of existing annotations are being called into question, with these questionable ORFs (qORFs) comprising up to one-fifth of the ‘current’ proteome. We discuss these in the context of an ideal genome-annotation strategy that considers the proteome as a rigorously defined subset of all possible coding sequences (‘the orfome’). (ii) Despite the greater apparent complexity of the fly (more cells, more complex physiology, longer lifespan), the nematode worm appears to have more genes. To explain this, we compare the annotated proteomes of worm and fly, relating to both genome-annotation and genome evolution issues. (iii) The unexpectedly small size of the gene complement estimated for the complete human genome provoked much public debate about the nature of biological complexity. However, in the first instance, for the human genome, the relationship between gene number and proteome size is far from simple. We survey the current estimates for the numbers of human genes and, from this, we estimate a range for the size of the human proteome. The determination of this is substantially hampered by the unknown extent of the cohort of pseudogenes (‘dead’ genes), in combination with the prevalence of alternative splicing. (Further information relating to yeast is available at http://genecensus.org/yeast/orfome)  相似文献   

6.
After 50 years of analysing Neurospora crassa genes one by one large scale sequence analysis has increased the number of accessible genes tremendously in the last few years. Being the only filamentous fungus for which a comprehensive genomic sequence database is publicly accessible N. crassa serves as the model for this important group of microorganisms. The MIPS N. crassa database currently holds more than 16 Mb of non-redundant data of the chromosomes II and V analysed by the German Neurospora Genome Project. This represents more than one-third of the genome. Open reading frames (ORFs) have been extracted from the sequence and the deduced proteins have been annotated extensively. They are classified according to matches in sequence databases and attributed to functional categories according to their relatives. While 41% of analysed proteins are related to known proteins, 30% are hypothetical proteins with no match to a database entry. The entire genome is expected to comprise some 13000 protein coding genes, more than twice as many as found in yeasts, and reflects the high potential of filamentous fungi to cope with various environmental conditions.  相似文献   

7.
Gene recognition from questionable ORFs in bacterial and archaeal genomes   总被引:1,自引:0,他引:1  
The ORFs of microbial genomes in annotation files are usually classified into two groups: the first corresponds to known genes; whereas the second includes 'putative', 'probable', 'conserved hypothetical', 'hypothetical', 'unknown' and 'predicted' ORFs etc. Since the annotation is not 100% accurate, it is essential to confirm which ORF of the latter group is coding and which is not. Starting from known genes in the former, this paper describes an improved Z curve method to recognize genes in the latter. Ten-fold cross-validation tests show that the average accuracy of the algorithm is greater than 99% for recognizing the known genes in 57 bacterial and archaeal genomes. The method is then applied to recognize genes of the latter group. The likely non-coding ORFs in each of the 57 bacterial or archaeal genomes studied here are recognized and listed at the website http://tubic.tju.edu.cn/ZCURVE_C_html/noncoding.html. The working mechanism of the algorithm has been discussed in details. A computer program, called ZCURVE_C, was written to calculate a coding score called Z-curve score for ORFs in the above 57 bacterial and archaeal genomes. Coding/non-coding is simply determined by the criterion of Z-curve score > 0/ Z-curve score < 0. A website has been set up to provide the service to calculate the Z-curve score. A user may submit the DNA sequence of an ORF to the server at http://tubic.tju.edu.cn/ZCURVE_C/Default.cgi, and the Z-curve score of the ORF is calculated and returned to the user immediately.  相似文献   

8.
Identifying all essential genomic components is critical for the assembly of minimal artificial life. In the genome-reduced bacterium Mycoplasma pneumoniae, we found that small ORFs (smORFs; < 100 residues), accounting for 10% of all ORFs, are the most frequently essential genomic components (53%), followed by conventional ORFs (49%). Essentiality of smORFs may be explained by their function as members of protein and/or DNA/RNA complexes. In larger proteins, essentiality applied to individual domains and not entire proteins, a notion we could confirm by expression of truncated domains. The fraction of essential non-coding RNAs (ncRNAs) non-overlapping with essential genes is 5% higher than of non-transcribed regions (0.9%), pointing to the important functions of the former. We found that the minimal essential genome is comprised of 33% (269,410 bp) of the M. pneumoniae genome. Our data highlight an unexpected hidden layer of smORFs with essential functions, as well as non-coding regions, thus changing the focus when aiming to define the minimal essential genome.  相似文献   

9.
Gene overlap occurs when two or more genes are encoded by the same nucleotides. This phenomenon is found in all taxonomic domains, but is particularly common in viruses, where it may increase the information content of compact genomes or influence the creation of new genes. Here we report a global comparative study of overlapping open reading frames (OvRFs) of 12,609 virus reference genomes in the NCBI database. We retrieved metadata associated with all annotated open reading frames (ORFs) in each genome record to calculate the number, length, and frameshift of OvRFs. Our results show that while the number of OvRFs increases with genome length, they tend to be shorter in longer genomes. The majority of overlaps involve +2 frameshifts, predominantly found in dsDNA viruses. Antisense overlaps in which one of the ORFs was encoded in the same frame on the opposite strand (−0) tend to be longer. Next, we develop a new graph-based representation of the distribution of overlaps among the ORFs of genomes in a given virus family. In the absence of an unambiguous partition of ORFs by homology at this taxonomic level, we used an alignment-free k-mer based approach to cluster protein coding sequences by similarity. We connect these clusters with two types of directed edges to indicate (1) that constituent ORFs are adjacent in one or more genomes, and (2) that these ORFs overlap. These adjacency graphs not only provide a natural visualization scheme, but also a novel statistical framework for analyzing the effects of gene- and genome-level attributes on the frequencies of overlaps.  相似文献   

10.
The published sequence of the Vibrio cholerae genome indicates that, in addition to the genes that encode proteins of known and unknown function, there are 1577 ORFs identified as conserved hypothetical or hypothetical gene candidates. Because the annotation is not 100% accurate, it is not known which of the 1577 ORFs are true protein-coding genes. In this paper, an algorithm based on the Z curve method, with sensitivity, specificity and accuracy greater than 98%, is used to solve this problem. Twenty-fold cross-validation tests show that the accuracy of the algorithm is 98.8%. A detailed discussion of the mechanism of the algorithm is also presented. It was found that 172 of the 1577 ORFs are unlikely to be protein-coding genes. The number of protein-coding genes in the V. cholerae genome was re-estimated and found to be approximately 3716. This result should be of use in microarray analysis of gene expression in the genome, because the cost of preparing chips may be somewhat decreased. A computer program was written to calculate a coding score called VCZ for gene identification in the genome. Coding/noncoding is simply determined by VCZ > 0/VCZ < 0. The program is freely available on request for academic use.  相似文献   

11.
12.
The complete nucleotide sequence of Saccharomyces cerevisiae chromosome X (745 442 bp) reveals a total of 379 open reading frames (ORFs), the coding region covering approximately 75% of the entire sequence. One hundred and eighteen ORFs (31%) correspond to genes previously identified in S. cerevisiae. All other ORFs represent novel putative yeast genes, whose function will have to be determined experimentally. However, 57 of the latter subset (another 15% of the total) encode proteins that show significant analogy to proteins of known function from yeast or other organisms. The remaining ORFs, exhibiting no significant similarity to any known sequence, amount to 54% of the total. General features of chromosome X are also reported, with emphasis on the nucleotide frequency distribution in the environment of the ATG and stop codons, the possible coding capacity of at least some of the small ORFs (<100 codons) and the significance of 46 non-canonical or unpaired nucleotides in the stems of some of the 24 tRNA genes recognized on this chromosome.  相似文献   

13.
A report on the EMBO/EMBL Symposium on The Non-Coding Genome, held in Heidelberg, Germany, 9-12 October, 2013.We share 98% coding genome similarity with mouse and have about the same number of protein coding genes as worms, yet the differences in complexity are obvious. Where is this complexity encoded? A huge change in our understanding of genome evolution and regulation of gene expression arrived with the development of high-throughput sequencing technologies. It turns out that most of our genome is transcribed, but only a small percentage has coding information imbedded. The rest of the genome, the non-coding genome, mistakenly labeled as ‘junk DNA’, is where evolutionary complexity resides. In The Non-Coding Genome meeting, several research studies delved deeper into the importance of the non-coding genome, identifying novel classes of non-coding RNAs (ncRNAs) and novel regulatory functions, and expanding our knowledge about this new world, opening more exciting questions to study and answer.  相似文献   

14.
The nucleotide sequences of three independent fragments (designated no. 3, 4, and 9; each 15–20 kb in size) of the genome of alkaliphilic Bacillus sp. C-125 cloned in a λ phage vector have been determined. Thirteen putative open reading frames (ORFs) were identified in sequenced fragment no. 3 and 11 ORFs were identified in no. 4. Twenty ORFs were also identified in fragment no. 9. All putative ORFs were analyzed in comparison with the BSORF database and non-redundant protein databases. The functions of 5 ORFs in fragment no. 3 and 3 ORFs in fragment no. 4 were suggested by their significant similarities to known proteins in the database. Among the 20 ORFs in fragment no. 9, the functions of 11 ORFs were similarly suggested. Most of the annotated ORFs in the DNA fragments of the genome of alkaliphilic Bacillus sp. C-125 were conserved in the Bacillus subtilis genome. The organization of ORFs in the genome of strain C-125 was found to differ from the order of genes in the chromosome of B. subtilis, although some gene clusters (ydh, yqi, yer, and yts) were conserved as operon units the same as in B. subtilis. Received: April 17, 1998 / Accepted: June 23, 1998  相似文献   

15.
The unannotated regions of the Escherichia coli genome DNA sequence from the EcoSeq6 database, totaling 1,278 'intergenic' sequences of the combined length of 359,279 basepairs, were analyzed using computer-assisted methods with the aim of identifying putative unknown genes. The proposed strategy for finding new genes includes two key elements: i) prediction of expressed open reading frames (ORFs) using the GeneMark method based on Markov chain models for coding and non-coding regions of Escherichia coli DNA, and ii) search for protein sequence similarities using programs based on the BLAST algorithm and programs for motif identification. A total of 354 putative expressed ORFs were predicted by GeneMark. Using the BLASTX and TBLASTN programs, it was shown that 208 ORFs located in the unannotated regions of the E. coli chromosome are significantly similar to other protein sequences. Identification of 182 ORFs as probable genes was supported by GeneMark and BLAST, comprising 51.4% of the GeneMark 'hits' and 87.5% of the BLAST 'hits'. 73 putative new genes, comprising 20.6% of the GeneMark predictions, belong to ancient conserved protein families that include both eubacterial and eukaryotic members. This value is close to the overall proportion of highly conserved sequences among eubacterial proteins, indicating that the majority of the putative expressed ORFs that are predicted by GeneMark, but have no significant BLAST hits, nevertheless are likely to be real genes. The majority of the putative genes identified by BLAST search have been described since the release of the EcoSeq6 database, but about 70 genes have not been detected so far. Among these new identifications are genes encoding proteins with a variety of predicted functions including dehydrogenases, kinases, several other metabolic enzymes, ATPases, rRNA methyltransferases, membrane proteins, and different types of regulatory proteins.  相似文献   

16.
Gene structure conservation aids similarity based gene prediction   总被引:4,自引:1,他引:3       下载免费PDF全文
One of the primary tasks in deciphering the functional contents of a newly sequenced genome is the identification of its protein coding genes. Existing computational methods for gene prediction include ab initio methods which use the DNA sequence itself as the only source of information, comparative methods using multiple genomic sequences, and similarity based methods which employ the cDNA or protein sequences of related genes to aid the gene prediction. We present here an algorithm implemented in a computer program called Projector which combines comparative and similarity approaches. Projector employs similarity information at the genomic DNA level by directly using known genes annotated on one DNA sequence to predict the corresponding related genes on another DNA sequence. It therefore makes explicit use of the conservation of the exon–intron structure between two related genes in addition to the similarity of their encoded amino acid sequences. We evaluate the performance of Projector by comparing it with the program Genewise on a test set of 491 pairs of independently confirmed mouse and human genes. It is more accurate than Genewise for genes whose proteins are <80% identical, and is suitable for use in a combined gene prediction system where other methods identify well conserved and non-conserved genes, and pseudogenes.  相似文献   

17.
Biologists routinely use molecular markers to identify conservation units, to quantify genetic connectivity, to estimate population sizes, and to identify targets of selection. Many imperiled eagle populations require such efforts and would benefit from enhanced genomic resources. We sequenced, assembled, and annotated the first eagle genome using DNA from a male golden eagle (Aquila chrysaetos) captured in western North America. We constructed genomic libraries that were sequenced using Illumina technology and assembled the high-quality data to a depth of ∼40x coverage. The genome assembly includes 2,552 scaffolds >10 Kb and 415 scaffolds >1.2 Mb. We annotated 16,571 genes that are involved in myriad biological processes, including such disparate traits as beak formation and color vision. We also identified repetitive regions spanning 92 Mb (∼6% of the assembly), including LINES, SINES, LTR-RTs and DNA transposons. The mitochondrial genome encompasses 17,332 bp and is ∼91% identical to the Mountain Hawk-Eagle (Nisaetus nipalensis). Finally, the data reveal that several anonymous microsatellites commonly used for population studies are embedded within protein-coding genes and thus may not have evolved in a neutral fashion. Because the genome sequence includes ∼800,000 novel polymorphisms, markers can now be chosen based on their proximity to functional genes involved in migration, carnivory, and other biological processes.  相似文献   

18.
The nitrogen-fixing, symbiotic bacterium Sinorhizobium meliloti reduces molecular dinitrogen to ammonia in a specific symbiotic context, supporting the nitrogen requirements of various forage legumes, including alfalfa. Determining the DNA sequence of the S. meliloti genome was an important step in plant-microbe interaction research, adding to the considerable information already available about this bacterium by suggesting possible functions for many of the >6,200 annotated open reading frames (ORFs). However, the predictive power of bioinformatic analysis is limited, and putting the role of these genes into a biological context will require more definitive functional approaches. We present here a strategy for genetic analysis of S. meliloti on a genomic scale and report the successful implementation of the first step of this strategy by constructing a set of plasmids representing 100% of the 6,317 annotated ORFs cloned into a mobilizable plasmid by using efficient PCR and recombination protocols. By using integrase recombination to insert these ORFs into other plasmids in vitro or in vivo (B. L. House et al., Appl. Environ. Microbiol. 70:2806-2815, 2004), this ORFeome can be used to generate various specialized genetic materials for functional analysis of S. meliloti, such as operon fusions, mutants, and protein expression plasmids. The strategy can be generalized to many other genome projects, and the S. meliloti clones should be useful for investigators wanting an accessible source of cloned genes encoding specific enzymes.  相似文献   

19.
The 50 non-coding bases immediately internal to the telomeric repeats in the two 5′ ends of macronuclear DNA molecules of a group of hypotrichous ciliates are anomalous in composition, consisting of 61% purines and 39% pyrimidines, A>T (ratio of 44:32), and G>C (ratio of 17:7). These ratio imbalances violate parity rule 2, according to which A should equal T and G should equal C within a DNA strand and therefore pyrimidines should equal purines. The purine-rich and base ratio imbalances are in marked contrast to the rest of the non-coding parts of the molecules, which have the theoretically expected purine content of 50%, with A = T and G = C. The ORFs contain an average of 52% purines as a result of bias in codon usage. The 50 bases that flank the 5′ ends of macronuclear sequences in micronuclear DNA (12 cases) consist of ~50% purines. Thus, the 50 bases in the 5′ ends of macronuclear sequences in micronuclear DNA are islands of purine richness in which A>T and G>C. These islands may serve as signals for the excision of macronuclear molecules during macronuclear development. We have found no published reports of coding or non-coding native DNA with such anomalous base composition.  相似文献   

20.
We have created a federated database for genome studies of Magnaporthe grisea, the causal agent of rice blast disease, by integrating end sequence data from BAC clones, genetic marker data and BAC contig assembly data. A library of 9216 BAC clones providing >25-fold coverage of the entire genome was end sequenced and fingerprinted by HindIII digestion. The Image/FPC software package was then used to generate an assembly of 188 contigs covering >95% of the genome. The database contains the results of this assembly integrated with hybridization data of genetic markers to the BAC library. AceDB was used for the core database engine and a MySQL relational database, populated with numerical representations of BAC clones within FPC contigs, was used to create appropriately scaled images. The database is being used to facilitate sequencing efforts. The database also allows researchers mapping known genes or other sequences of interest, rapid and easy access to the fundamental organization of the M.grisea genome. This database, MagnaportheDB, can be accessed on the web at http://www.cals.ncsu.edu/fungal_genomics/mgdatabase/int.htm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号