首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Ab initio gene identification in metagenomic sequences   总被引:1,自引:0,他引:1  
We describe an algorithm for gene identification in DNA sequences derived from shotgun sequencing of microbial communities. Accurate ab initio gene prediction in a short nucleotide sequence of anonymous origin is hampered by uncertainty in model parameters. While several machine learning approaches could be proposed to bypass this difficulty, one effective method is to estimate parameters from dependencies, formed in evolution, between frequencies of oligonucleotides in protein-coding regions and genome nucleotide composition. Original version of the method was proposed in 1999 and has been used since for (i) reconstructing codon frequency vector needed for gene finding in viral genomes and (ii) initializing parameters of self-training gene finding algorithms. With advent of new prokaryotic genomes en masse it became possible to enhance the original approach by using direct polynomial and logistic approximations of oligonucleotide frequencies, as well as by separating models for bacteria and archaea. These advances have increased the accuracy of model reconstruction and, subsequently, gene prediction. We describe the refined method and assess its accuracy on known prokaryotic genomes split into short sequences. Also, we show that as a result of application of the new method, several thousands of new genes could be added to existing annotations of several human and mouse gut metagenomes.  相似文献   

2.
3.
It is known that while the programs used to find genes in prokaryotic genomes reliably map protein-coding regions, they often fail in the exact determination of gene starts. This problem is further aggravated by sequencing errors, most notably insertions and deletions leading to frame-shifts. Therefore, the exact mapping of gene starts and identification of frame-shifts are important problems of the computer-assisted functional analysis of newly sequenced genomes. Here we review methods of gene recognition and describe a new algorithm for correction of gene starts and identification of frame-shifts in prokaryotic genomes. The algorithm is based on the comparison of nucleotide and protein sequences of homologous genes from related organisms, using the assumption that the rate of evolutionary changes in protein-coding regions is lower than that in non-coding regions. A dynamic programming algorithm is used to align protein sequences obtained by formal translation of genomic nucleotide sequences. The possibility of frame-shifts is taken into account. The algorithm was tested on several groups of related organisms: gamma-proteobacteria, the Bacillus/Clostridium group, and three Pyrococcus genomes. The testing demonstrated that, dependent or a genome, 1-10 per cent of genes have incorrect starts or contain frame-shifts. The algorithm is implemented in the program package Orthologator-GeneCorrector.  相似文献   

4.
To analyze the mitogenome of the amphipod Onisimus nanseni, we amplified the complete mitogenome of O. nanseni using long-PCR and genome walking techniques. The mitogenome of O. nanseni is circular and contains all the typical mt genes (2 rRNAs, 22 tRNAs, and 13 protein-coding genes). It has two peculiar non-coding regions of 148 bp and 194 bp. The latter can be involved in replication and termination processes. The total length of the pooled protein-coding, rRNA, and tRNA genes is shorter than those of other crustaceans. In addition, the intergenic spacers of the O. nanseni mitogenome are considerably shorter in length than those of other crustaceans. Fourteen adjacent genes overlap, resulting in a compact mitogenomic structure. In the O. nanseni mitogenome, the AT composition is elevated, particularly in the control regions (78.9% AT), as has been demonstrated for two other amphipods. The tRNA order is highly rearranged compared to other arthropod mitogenomes, but the order of protein-coding genes and rRNAs is largely conserved. The gene cluster between the CO1 and CO3 genes is completely conserved among all amphipods compared. This provides insights into the evolution and gene structures of crustacean mitochondrial genomes, particularly in amphipods.  相似文献   

5.
A large number of complete microorganism genomes has been sequenced and submitted to the public database and then incorporated into our complete genome database, Genome Information Broker (GIB, http://gib.genes.nig.ac.jp/). However, when comparative genomics is carried out, researchers must be aware that there are protein-coding genes not confirmed by homology or motif search and that reliable protein-coding genes are missing. Therefore, we developed a protocol (Gene Trek in Prokaryote Space, GTPS) for finding possible protein-coding genes in bacterial genomes. GTPS assigns a degree of reliability to predicted protein-coding genes. We first systematically applied the protocol to the complete genomes of all 123 bacterial species and strains that were publicly available as of July 2003, and then to those of 183 species and strains available as of September 2004. We found a number of incorrect genes and several new ones in the genome data in question. We also found a way to estimate the total number of orthologous genes in the bacterial world.  相似文献   

6.
Coding information is the main source of heterogeneity (non-randomness) in the sequences of microbial genomes. The heterogeneity corresponds to a cluster structure in triplet distributions of relatively short genomic fragments (200-400 bp). We found a universal 7-cluster structure in microbial genomic sequences and explained its properties. We show that codon usage of bacterial genomes is a multi-linear function of their genomic G+C-content with high accuracy. Based on the analysis of 143 completely sequenced bacterial genomes available in Genbank in August 2004, we show that there are four "pure" types of the 7-cluster structure observed. All 143 cluster animated 3D-scatters are collected in a database which is made available on our web-site (http://www.ihes.fr/~zinovyev/7clusters). The findings can be readily introduced into software for gene prediction, sequence alignment or microbial genomes classification.  相似文献   

7.
Xing XB  Li QR  Sun H  Fu X  Zhan F  Huang X  Li J  Chen CL  Shyr Y  Zeng R  Li YX  Xie L 《Genomics》2011,98(5):343-351
Identifying protein-coding genes in eukaryotic genomes remains a challenge in post-genome era due to the complex gene models. We applied a proteogenomics strategy to detect un-annotated protein-coding regions in mouse genome. High-accuracy tandem mass spectrometry (MS/MS) data from diverse mouse samples were generated by LTQ-Orbitrap mass spectrometer in house. Two searchable diagnostic proteomic datasets were constructed, one with all possible encoding exon junctions, and the other with all putative encoding exons, for the discovery of novel exon splicing events and novel uninterrupted protein-coding regions. Altogether 29,586 unique peptides were identified. Aligning backwards to the mouse genome, the translation of 4471 annotated genes was validated by the known peptides; and 172 genic events were defined in mouse genome by the novel peptides. The approach in the current work can provide substantial evidences for eukaryote genome annotation in encoding genes.  相似文献   

8.
Plasmodium parasites, the causal agents of malaria, result in more than 1 million deaths annually. Plasmodium are unicellular eukaryotes with small ∼23 Mb genomes encoding ∼5200 protein-coding genes. The protein-coding genes comprise about half of these genomes. Although evolutionary processes have a significant impact on malaria control, the selective pressures within Plasmodium genomes are poorly understood, particularly in the non-protein-coding portion of the genome. We use evolutionary methods to describe selective processes in both the coding and non-coding regions of these genomes. Based on genome alignments of seven Plasmodium species, we show that protein-coding, intergenic and intronic regions are all subject to purifying selection and we identify 670 conserved non-genic elements. We then use genome-wide polymorphism data from P. falciparum to describe short-term selective processes in this species and identify some candidate genes for balancing (diversifying) selection. Our analyses suggest that there are many functional elements in the non-genic regions of these genomes and that adaptive evolution has occurred more frequently in the protein-coding regions of the genome.  相似文献   

9.
In this work, the mitochondrial genomes for spotted halibut (Verasper variegatus) and barfin flounder (Verasper moseri) were completely sequenced. The entire mitochondrial genome sequences of the spotted halibut and barfin flounder were 17,273 and 17,588 bp in length, respectively. The organization of the two mitochondrial genomes was similar to those reported from other fish mitochondrial genomes containing 37 genes (2 rRNAs, 22 tRNAs and 13 protein-coding genes) and two non-coding regions (control region (CR) and WANCY region). In the CR, the termination associated sequence (ETAS), six central conserved block (CSB-A,B,C,D,E,F), three conserved sequence blocks (CSB1-3) and a region of 61-bp tandem repeat cluster at the end of CSB-3 were identified by similarity comparison with fishes and other vertebrates. The tandem repeat sequences show polymorphism among the different individuals of the two species. The complete mitochondrial genomes of spotted halibut and barfin flounder should be useful for evolutionary studies of flatfishes and other vertebrate species.  相似文献   

10.
Some bacterial genomes are known to have low CpG dinucleotide frequencies. While their causes are not clearly understood, the frequency of CpG is suppressed significantly in the genome of Mycoplasma genitalium, but not in that of Mycoplasma pneumoniae. We compared orthologous gene pairs of the two closely related species to analyze CpG substitution patterns between these two genomes. We also divided genome sequences into three regions: protein-coding, noncoding, and RNA-coding, and obtained the CpG frequencies for each region for each organism. It was found that the observed/expected ratio of CpG dinucleotides is low in both the protein-coding and noncoding regions; while that ratio is in the normal range in the RNA-coding region. Our results indicate that CpG suppression of the Mycoplasma genome is not caused by (1) biased usage amino acid; (2) biased usage of synonymous codon; or (3) methylation effects by the CpG methyltransferase in the genomes of their hosts. Instead, we consider it likely that a certain global pressure, such as genome-wide pressure for the advantages of DNA stability or replication, has the effect of decreasing CpG over the entire genome, which, in turn, resulted in the biased codon usage.  相似文献   

11.
为重建喉毛花属下系统发育关系,明晰属下皱边喉毛花及其近缘种之间的物种关系。本研究利用Illumina高通量测序平台对12 个叶绿体基因组进行双末端测序,获得大量高质量的Clean reads用于后续生物信息学分析。结果表明:(1)喉毛花属下物种的基因组差异较小,均在150 kb左右,基因总数为131 个,其中编码基因81 个。IR区核苷酸多态性比SC低,编码区比非编码区更保守。(2)进化分析结果显示,几乎所有的编码基因受到纯化选择的作用。(3)密码子偏好性分析表明有35 个密码子的RSCU值均大于1,说明使用这些密码子的频率较高,各项密码子偏好性衡量指标说明喉毛花属物种的密码子偏好性较弱。(4)系统发育分析表明CDS、密码子位置与基因间隔区数据集构建的系统发育树具有高度一致的拓扑结构,大部分分支的支持率高。这些结果表明皱边喉毛花及其近缘种的叶绿体基因组无明显差异,在系统发育树上无法按物种聚类,也为后续展开喉毛花属下群体遗传学研究提供科学依据。  相似文献   

12.
CpG islands are discrete regions of DNA with significantly greater frequencies of CpG doublets than bulk genomic DNA. They are most frequently associated with the 5'-ends of housekeeping genes and are involved in the regulation of their expression. In this study, the structure and evolution of CpG islands within genes of the myc family were evaluated with the protein-coding sequences of animals and their transducing viruses. These evaluations relied on a gene tree for the entire myc family to test the origins of CpG islands within their two protein-coding exons. Overall, CG-very rich and CG-rich islands are associated with exon 2 of the different myc genes of warm-blooded vertebrates and with exon 3 of the N-myc and s-myc sequences of mammals, but not birds. These overall distributions of well-developed islands can be related to the major transitions of the CG-rich genomes of warm-blooded vertebrates from the CG-poor ones of other animals. In turn, the greater variability of well-developed islands within exon 3 of the N-myc gene and among the different retrogenes of the myc family can be attributed to their reduced functional constraints, as evidenced by their limited and very restricted patterns of expression, respectively.  相似文献   

13.
Genes underlying important phenotypic differences between Plasmodium species, the causative agents of malaria, are frequently found in only a subset of species and cluster at dynamically evolving subtelomeric regions of chromosomes. We hypothesized that chromosome-internal regions of Plasmodium genomes harbour additional species subset-specific genes that underlie differences in human pathogenicity, human-to-human transmissibility, and human virulence. We combined sequence similarity searches with synteny block analyses to identify species subset-specific genes in chromosome-internal regions of six published Plasmodium genomes, including Plasmodium falciparum, Plasmodium vivax, Plasmodium knowlesi, Plasmodium yoelii, Plasmodium berghei, and Plasmodium chabaudi. To improve comparative analysis, we first revised incorrectly annotated gene models using homology-based gene finders and examined putative subset-specific genes within syntenic contexts. Confirmed subset-specific genes were then analyzed for their role in biological pathways and examined for molecular functions using publicly available databases. We identified 16 genes that are well conserved in the three primate parasites but not found in rodent parasites, including three key enzymes of the thiamine (vitamin B1) biosynthesis pathway. Thirteen genes were found to be present in both human parasites but absent in the monkey parasite P. knowlesi, including genes specifically upregulated in sporozoites or gametocytes that could be linked to parasite transmission success between humans. Furthermore, we propose 15 chromosome-internal P. falciparum-specific genes as new candidate genes underlying increased human virulence and detected a currently uncharacterized cluster of P. vivax-specific genes on chromosome 6 likely involved in erythrocyte invasion. In conclusion, Plasmodium species harbour many chromosome-internal differences in the form of protein-coding genes, some of which are potentially linked to human disease and thus promising leads for future laboratory research.  相似文献   

14.
MOTIVATION: As the number of fully sequenced prokaryotic genomes continues to grow rapidly, computational methods for reliably detecting protein-coding regions become even more important. Audic and Claverie (1998) Proc. Natl Acad. Sci. USA, 95, 10026-10031, have proposed a clustering algorithm for protein-coding regions in microbial genomes. The algorithm is based on three Markov models of order k associated with subsequences extracted from a given genome. The parameters of the three Markov models are recursively updated by the algorithm which, in simulations, always appear to converge to a unique stable partition of the genome. The partition corresponds to three kinds of regions: (1) coding on the direct strand, (2) coding on the complementary strand, (3) non-coding. RESULTS: Here we provide an explanation for the convergence of the algorithm by observing that it is essentially a form of the expectation maximization (EM) algorithm applied to the corresponding mixture model. We also provide a partial justification for the uniqueness of the partition based on identifiability. Other possible variations and improvements are briefly discussed.  相似文献   

15.
The natural cycles of night and day, and their length, remain stable in near-equatorial African regions but they vary with latitude and season in Eurasia. This new environmental factor might shape the adaptation of circadian rhythms of Eurasians after the out-of-African dispersal of their African ancestors. To identify the genetic-based signatures of this adaptation, geographic variation in allele frequencies of more than 2300 genetic variants was analyzed using data from 5 African and 11 Eurasian populations of the 1000 Genomes Project. The genetic signatures of latitude-dependent polygenic selection were found more frequently within non-coding DNA regions associated with morningness–eveningness in genome-wide association studies (GWASs) than among polymorphisms hinted by GWASs of other traits/diseases and among polymorphisms sampled from pseudogenes and from protein-coding regions in either circadian clock genes or reference genes. Some of such variants were located within the introgressions of the Neanderthal’s genome into the genomes of Eurasians.  相似文献   

16.
Genome sequences are annotated by computational prediction of coding sequences, followed by similarity searches such as BLAST, which provide a layer of possible functional information. While the existence of processes such as alternative splicing complicates matters for eukaryote genomes, the view of bacterial genomes as a linear series of closely spaced genes leads to the assumption that computational annotations that predict such arrangements completely describe the coding capacity of bacterial genomes. We undertook a proteomic study to identify proteins expressed by Pseudomonas fluorescens Pf0-1 from genes that were not predicted during the genome annotation. Mapping peptides to the Pf0-1 genome sequence identified sixteen non-annotated protein-coding regions, of which nine were antisense to predicted genes, six were intergenic, and one read in the same direction as an annotated gene but in a different frame. The expression of all but one of the newly discovered genes was verified by RT-PCR. Few clues as to the function of the new genes were gleaned from informatic analyses, but potential orthologs in other Pseudomonas genomes were identified for eight of the new genes. The 16 newly identified genes improve the quality of the Pf0-1 genome annotation, and the detection of antisense protein-coding genes indicates the under-appreciated complexity of bacterial genome organization.  相似文献   

17.
18.
The sinipercids are a group of 12 species of freshwater percoid fish endemic to East Asia and their phylogenetic placements have perplexed generations of taxonomists. We cloned and sequenced the complete mitochondrial DNA (mtDNA) of three sinipercid fishes (Siniperca chuatsi, S. kneri, and S. scherzeri) to characterize and compare their mitochondrial genomes. The mitochondrial genomes of S. chuatsi, S. kneri, and S. scherzeri were 16,496, 17,002, and 16,585?bp in length, respectively. The organization of the three mitochondrial genomes is similar to those reported from other fish mitochondrial genomes, which contains 37 genes (13 protein-coding genes, 2 ribosomal RNAs, and 22 transfer RNAs) and a major non-coding control region. Among the 13 protein-coding genes of all the three sinipercid fishes, three reading-frame overlaps were found on the same strand. There is an 81-bp tandem repeat cluster at the end of CSB-3 in the S. scherzeri control region. The complete mitochondrial genomes of the three sinipercids should be useful for the evolutionary studies of sinipercids and other vertebrate species.  相似文献   

19.
Ogoh K  Ohmiya Y 《Gene》2004,327(1):131-139
The primary structure of the mitochondrial genome of the bioluminescent crustacean, Vargula hilgendorfii, the sea-firefly (Arthropoda, Crustacea, Ostracoda), has sequenced using the transposon Tn5. The genome (15,923 bp) contains the same 37 genes (two ribosomal RNAs, 22 transfer RNAs, and 13 protein-coding genes) found in other Arthropoda. Interestingly, duplicate control regions (fragments of 778 and 855 bp) and triplicate short repeat sequences (fragments of 49 bp) occur. The AT composition of the protein-coding genes is lower than the published complete mitochondrial genomes within the Arthropoda. For gene arrangement, 13 transfer RNA genes and two protein-coding genes have moved and inserted directly or inversely relative to the typical Arthropoda order.  相似文献   

20.
Copepoda is the most diverse and abundant group of crustaceans, but its phylogenetic relationships are ambiguous. Mitochondrial (mt) genomes are useful for studying evolutionary history, but only six complete Copepoda mt genomes have been made available and these have extremely rearranged genome structures. This study determined the mt genome of Calanus hyperboreus, making it the first reported Arctic copepod mt genome and the first complete mt genome of a calanoid copepod. The mt genome of C. hyperboreus is 17,910 bp in length and it contains the entire set of 37 mt genes, including 13 protein-coding genes, 2 rRNAs, and 22 tRNAs. It has a very unusual gene structure, including the longest control region reported for a crustacean, a large tRNA gene cluster, and reversed GC skews in 11 out of 13 protein-coding genes (84.6%). Despite the unusual features, comparing this genome to published copepod genomes revealed retained pan-crustacean features, as well as a conserved calanoid-specific pattern. Our data provide a foundation for exploring the calanoid pattern and the mechanisms of mt gene rearrangement in the evolutionary history of the copepod mt genome.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号