首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Gene identification in novel eukaryotic genomes by self-training algorithm   总被引:8,自引:0,他引:8  
Finding new protein-coding genes is one of the most important goals of eukaryotic genome sequencing projects. However, genomic organization of novel eukaryotic genomes is diverse and ab initio gene finding tools tuned up for previously studied species are rarely suitable for efficacious gene hunting in DNA sequences of a new genome. Gene identification methods based on cDNA and expressed sequence tag (EST) mapping to genomic DNA or those using alignments to closely related genomes rely either on existence of abundant cDNA and EST data and/or availability on reference genomes. Conventional statistical ab initio methods require large training sets of validated genes for estimating gene model parameters. In practice, neither one of these types of data may be available in sufficient amount until rather late stages of the novel genome sequencing. Nevertheless, we have shown that gene finding in eukaryotic genomes could be carried out in parallel with statistical models estimation directly from yet anonymous genomic DNA. The suggested method of parallelization of gene prediction with the model parameters estimation follows the path of the iterative Viterbi training. Rounds of genomic sequence labeling into coding and non-coding regions are followed by the rounds of model parameters estimation. Several dynamically changing restrictions on the possible range of model parameters are added to filter out fluctuations in the initial steps of the algorithm that could redirect the iteration process away from the biologically relevant point in parameter space. Tests on well-studied eukaryotic genomes have shown that the new method performs comparably or better than conventional methods where the supervised model training precedes the gene prediction step. Several novel genomes have been analyzed and biologically interesting findings are discussed. Thus, a self-training algorithm that had been assumed feasible only for prokaryotic genomes has now been developed for ab initio eukaryotic gene identification.  相似文献   

2.
We describe FrameD, a program that predicts coding regions in prokaryotic and matured eukaryotic sequences. Initially targeted at gene prediction in bacterial GC rich genomes, the gene model used in FrameD also allows to predict genes in the presence of frameshifts and partially undetermined sequences which makes it also very suitable for gene prediction and frameshift correction in unfinished sequences such as EST and EST cluster sequences. Like recent eukaryotic gene prediction programs, FrameD also includes the ability to take into account protein similarity information both in its prediction and its graphical output. Its performances are evaluated on different bacterial genomes. The web site (http://genopole.toulouse.inra.fr/bioinfo/FrameD/FD) allows direct prediction, sequence correction and translation and the ability to learn new models for new organisms.  相似文献   

3.

Background

The Generalized Hidden Markov Model (GHMM) has proven a useful framework for the task of computational gene prediction in eukaryotic genomes, due to its flexibility and probabilistic underpinnings. As the focus of the gene finding community shifts toward the use of homology information to improve prediction accuracy, extensions to the basic GHMM model are being explored as possible ways to integrate this homology information into the prediction process. Particularly prominent among these extensions are those techniques which call for the simultaneous prediction of genes in two or more genomes at once, thereby increasing significantly the computational cost of prediction and highlighting the importance of speed and memory efficiency in the implementation of the underlying GHMM algorithms. Unfortunately, the task of implementing an efficient GHMM-based gene finder is already a nontrivial one, and it can be expected that this task will only grow more onerous as our models increase in complexity.

Results

As a first step toward addressing the implementation challenges of these next-generation systems, we describe in detail two software architectures for GHMM-based gene finders, one comprising the common array-based approach, and the other a highly optimized algorithm which requires significantly less memory while achieving virtually identical speed. We then show how both of these architectures can be accelerated by a factor of two by optimizing their content sensors. We finish with a brief illustration of the impact these optimizations have had on the feasibility of our new homology-based gene finder, TWAIN.

Conclusions

In describing a number of optimizations for GHMM-based gene finders and making available two complete open-source software systems embodying these methods, it is our hope that others will be more enabled to explore promising extensions to the GHMM framework, thereby improving the state-of-the-art in gene prediction techniques.  相似文献   

4.
Recent advances in gene structure prediction   总被引:9,自引:0,他引:9  
De novo gene predictors are programs that predict the exon-intron structures of genes using the sequences of one or more genomes as their only input. In the past two years, dual-genome de novo predictors, which exploit local rates and patterns of mutation inferred from alignments between two genomes, have led to significant improvements in accuracy. Systems that exploit more than two genomes simultaneously have only recently begun to appear and are not yet competitive on practical tasks, but offer the greatest hope for near-term improvements. Dual-genome de novo prediction for compact eukaryotic genomes such as those of Arabidopsis thaliana and Caenorhabditis elegans is already quite accurate. Although mammalian gene prediction lags behind in accuracy, it is yielding ever more useful results. Coupled with significant improvements in pseudogene detection methods, which have eliminated many false positives, we have reached the point where de novo gene predictions are being used as hypotheses to drive experimental annotation via systematic RT-PCR and sequencing.  相似文献   

5.
6.
Codon usages in different gene classes of the Escherichia coli genome   总被引:3,自引:0,他引:3  
A new measure for assessing codon bias of one group of genes with respect to a second group of genes is introduced. In this formulation, codon bias correlations for Escherichia coli genes are evaluated for level of expression, for contrasts along genes, for genes in different 200 kb (or longer) contigs around the genome, for effects of gene size, for variation over different function classes, for codon bias in relation to possible lateral transfer and for dicodon bias for some gene classes. Among the function classes, codon biases of ribosomal proteins are the most deviant from the codon frequencies of the average E. coli gene. Other classes of ‘highly expressed genes’ (e.g. amino acyl tRNA synthetases, chaperonins, modification genes essential to translation activities) show less extreme codon biases. Consistently for genes with experimentally determined expression rates in the exponential growth phase, those of highest molar abundances are more deviant from the average gene codon frequencies and are more similar in codon frequencies to the average ribosomal protein gene. Independent of gene size, the codon biases in the 5′ third of genes deviate by more than a factor of two from those in the middle and 3′ thirds. In this context, there appear to be conflicting selection pressures imposed by the constraints of ribosomal binding, or more generally the early phase of protein synthesis (about the first 50 codons) may be more biased than the complete nascent polypeptide. In partitioning the E. coli genome into 10 equal lengths, pronounced differences in codon site 3 G+C frequencies accumulate. Genes near to oriC have 5% greater codon site 3 G+C frequencies than do genes from the ter region. This difference also is observed between small (100–300 codons) and large (>800 codons) genes. This result contrasts with that for eukaryotic genomes (including human, Caenorhabditis elegans and yeast) where long genes tend to have site 3 more AT rich than short genes. Many of the above results are special for E. coli genes and do not apply to genes of most bacterial genomes. A gene is defined as alien (possibly horizontally transferred) if its codon bias relative to the average gene exceeds a high threshold and the codon bias relative to ribosomal proteins is also appropriately high. These are identified, including four clusters (operons). The bulk of these genes have no known function.  相似文献   

7.
We present evidence supporting the notion that codon usage (CU) compatibility between foreign genes and recipient genomes is an important prerequisite to assess the selective advantage of imported functions, and therefore to increase the fixation probability of horizontal gene transfer (HGT) events. This contrasts with the current tendency in research to predict recent HGTs in prokaryotes by assuming that acquired genes generally display poor CU. By looking at the CU level (poor, typical, or rich) exhibited by putative xenologs still resembling their original CU, we found that most alien genes predominantly present typical CU immediately upon introgression, thereby suggesting that the role of CU amelioration in HGT has been overemphasized. In our strategy, we first scanned a representative set of 103 complete prokaryotic genomes for all pairs of candidate xenologs (exported/imported genes) displaying similar CU. We applied additional filtering criteria, including phylogenetic validations, to enhance the reliability of our predictions. Our approach makes no assumptions about the CU of foreign genes being typical or atypical within the recipient genome, thus providing a novel unbiased framework to study the evolutionary dynamics of HGT.  相似文献   

8.
Predicting protein-coding genes still remains a significant challenge. Although a variety of computational programs that use commonly machine learning methods have emerged, the accuracy of predictions remains a low level when implementing in large genomic sequences. Moreover, computational gene finding in newly se- quenced genomes is especially a difficult task due to the absence of a training set of abundant validated genes. Here we present a new gene-finding program, SCGPred, to improve the accuracy of prediction by combining multiple sources of evidence. SCGPred can perform both supervised method in previously well-studied genomes and unsupervised one in novel genomes. By testing with datasets composed of large DNA sequences from human and a novel genome of Ustilago maydi, SCGPred gains a significant improvement in comparison to the popular ab initio gene predictors. We also demonstrate that SCGPred can significantly improve prediction in novel genomes by combining several foreign gene finders with similarity alignments, which is superior to other unsupervised methods. Therefore, SCGPred can serve as an alternative gene-finding tool for newly sequenced eukaryotic genomes. The program is freely available at http://bio.scu.edu.cn/SCGPred/.  相似文献   

9.
《Biophysical journal》2022,121(22):4311-4324
The genetic code gives precise instructions on how to translate codons into amino acids. Due to the degeneracy of the genetic code—18 out of 20 amino acids are encoded for by more than one codon—more information can be stored in a basepair sequence. Indeed, various types of additional information have been discussed in the literature, e.g., the positioning of nucleosomes along eukaryotic genomes and the modulation of the translating efficiency in ribosomes to influence cotranslational protein folding. The purpose of this study is to show that it is indeed possible to carry more than one additional layer of information on top of a gene. In particular, we show how much translation efficiency and nucleosome positioning can be adjusted simultaneously without changing the encoded protein. We achieve this by mapping genes on weighted graphs that contain all synonymous genes, and then finding shortest paths through these graphs. This enables us, for example, to readjust the disrupted translational efficiency profile after a gene has been introduced from one organism (e.g., human) into another (e.g., yeast) without greatly changing the nucleosome landscape intrinsically encoded by the DNA molecule.  相似文献   

10.
GeneMark.hmm: new solutions for gene finding.   总被引:35,自引:0,他引:35       下载免费PDF全文
The number of completely sequenced bacterial genomes has been growing fast. There are computer methods available for finding genes but yet there is a need for more accurate algorithms. The GeneMark. hmm algorithm presented here was designed to improve the gene prediction quality in terms of finding exact gene boundaries. The idea was to embed the GeneMark models into naturally derived hidden Markov model framework with gene boundaries modeled as transitions between hidden states. We also used the specially derived ribosome binding site pattern to refine predictions of translation initiation codons. The algorithm was evaluated on several test sets including 10 complete bacterial genomes. It was shown that the new algorithm is significantly more accurate than GeneMark in exact gene prediction. Interestingly, the high gene finding accuracy was observed even in the case when Markov models of order zero, one and two were used. We present the analysis of false positive and false negative predictions with the caution that these categories are not precisely defined if the public database annotation is used as a control.  相似文献   

11.
Some of the principal transitions in the evolution of eukaryotes are characterized by engulfment of prokaryotes by primitive eukaryotic cells. In particular, approximately 1.6 billion years ago, engulfment of a cyanobacterium that became the ancestor of chloroplasts and other plastids gave rise to Plantae, the major branch of eukaryotes comprised of glaucophytes, red algae, green algae, and green plants. After endosymbiosis, there was large-scale migration of genes from the endosymbiont to the nuclear genome of the host such that approximately 18% of the nuclear genes in Arabidopsis appear to be of chloroplast origin. To gain insights into the process of evolution of gene structure in these, originally, intronless genes, we compared the properties and the evolutionary dynamics of introns in genes of plastid origin and ancestral eukaryotic genes in Arabidopsis, poplar, and rice genomes. We found that intron densities in plastid-derived genes were slightly but significantly lower than those in ancestral eukaryotic genes. Although most of the introns in both categories of genes were conserved between monocots (rice) and dicots (Arabidopsis and poplar), lineage-specific intron gain was more pronounced in plastid-derived genes than in ancestral genes, whereas there was no significant difference in the intron loss rates between the 2 classes of genes. Thus, after the transfer to the nuclear genome, the plastid-derived genes have undergone a massive intron invasion that, by the time of the divergence of dicots and monocots (150-200 MYA), yielded intron densities only slightly lower than those in ancestral genes. Nevertheless, the accumulation of introns in plastid-derived genes appears not to have reached saturation and continues to this time, albeit at a low rate. The overall pattern of intron gain and loss in the plastid-derived genes is shaped by this continuing gain and the more general tendency for loss that is characteristic of the recent evolution of plant genes.  相似文献   

12.
13.
Chatterji S  Pachter L 《Genomics》2007,90(1):44-48
The exon-intron structure of eukaryotic genes allows for phenomena such as alternative splicing, nonsense-mediated decay, and regulation through untranslated regions. However, the evolution of the exon structure of genes is not well elucidated because of limited and phylogenetically sparse data sets. In this study, we use the phylogenetically diverse sequencing of the ENCODE regions to study gene structure evolution in mammalian genomes. This first phylogenetically diverse study of gene structure changes offers insights into the mode and tempo of mammalian gene structure evolution. The genes undergoing structure changes appear to be moderately to highly expressed in germline cells and show levels of selection similar to those of other ENCODE genes. Patterns of gene duplication of the affected genes are more complex than expected. The number of sampled genomes is sufficiently dense to infer that certain gene duplications happened after intron loss. Thus, although gene duplication is highly correlated with intron loss, we conclude that structural changes in genes are not necessarily due to a loss of constraint following gene duplication as previously suggested.  相似文献   

14.
Members of the Deinococcaceae (e.g., Thermus, Meiothermus, Deinococcus) contain A/V-ATPases typically found in Archaea or Eukaryotes which were probably acquired by horizontal gene transfer. Two methods were used to quantify the extent to which archaeal or eukaryotic genes have been acquired by this lineage. Screening of a Meiothermus ruber library with probes made against Thermoplasma acidophilum DNA yielded a number of clones which hybridized more strongly than background. One of these contained the prolyl tRNA synthetase (RS) gene. Phylogenetic analysis shows the M. ruber and D. radiodurans prolyl RS to be more closely related to archaeal and eukaryal forms of this gene than to the typical bacterial type. Using a bioinformatics approach, putative open reading frames (ORFs) from the prerelease version of the D. radiodurans genome were screened for genes more closely related to archaeal or eukaryotic genes. Putative ORFs were searched against representative genomes from each of the three domains using automated BLAST. ORFs showing the highest matches against archaeal and eukaryotic genes were collected and ranked. Among the top-ranked hits were the A/V-ATPase catalytic and noncatalytic subunits and the prolyl RS genes. Using phylogenetic methods, ORFs were analyzed and trees assessed for evidence of horizontal gene transfer. Of the 45 genes examined, 20 showed topologies in which D. radiodurans homologues clearly group with eukaryotic or archaeal homologues, and 17 additional trees were found to show probable evidence of horizontal gene transfer. Compared to the total number of ORFs in the genome, those that can be identified as having been acquired from Archaea or Eukaryotes are relatively few (approximately 1%), suggesting that interdomain transfer is rare.  相似文献   

15.

Background

The influence of lateral gene transfer on gene origins and biology in eukaryotes is poorly understood compared with those of prokaryotes. A number of independent investigations focusing on specific genes, individual genomes, or specific functional categories from various eukaryotes have indicated that lateral gene transfer does indeed affect eukaryotic genomes. However, the lack of common methodology and criteria in these studies makes it difficult to assess the general importance and influence of lateral gene transfer on eukaryotic genome evolution.

Results

We used a phylogenomic approach to systematically investigate lateral gene transfer affecting the proteomes of thirteen, mainly parasitic, microbial eukaryotes, representing four of the six eukaryotic super-groups. All of the genomes investigated have been significantly affected by prokaryote-to-eukaryote lateral gene transfers, dramatically affecting the enzymes of core pathways, particularly amino acid and sugar metabolism, but also providing new genes of potential adaptive significance in the life of parasites. A broad range of prokaryotic donors is involved in such transfers, but there is clear and significant enrichment for bacterial groups that share the same habitats, including the human microbiota, as the parasites investigated.

Conclusions

Our data show that ecology and lifestyle strongly influence gene origins and opportunities for gene transfer and reveal that, although the outlines of the core eukaryotic metabolism are conserved among lineages, the genes making up those pathways can have very different origins in different eukaryotes. Thus, from the perspective of the effects of lateral gene transfer on individual gene ancestries in different lineages, eukaryotic metabolism appears to be chimeric.  相似文献   

16.
Resolving the structure of the eukaryotic tree of life remains one of the most important and challenging tasks facing biologists. The notion of six eukaryotic 'supergroups' has recently gained some acceptance, and several papers in 2007 suggest that resolution of higher taxonomic levels is possible. However, in organisms that acquired photosynthesis via secondary (i.e. eukaryote-eukaryote) endosymbiosis, the host nuclear genome is a mosaic of genes derived from two (or more) nuclei, a fact that is often overlooked in studies attempting to reconstruct the deep evolutionary history of eukaryotes. Accurate identification of gene transfers and replacements involving eukaryotic donor and recipient genomes represents a potentially formidable challenge for the phylogenomics community as more protist genomes are sequenced and concatenated data sets grow.  相似文献   

17.
Wada and colleagues have shown that, whether prokaryotic or eukaryotic, each gene has a "homostabilising propensity" to adopt a relatively uniform GC percentage (GC%). Accordingly, each gene can be viewed as a "microisochore" occupying a discrete GC% niche of relatively uniform base composition amongst its fellow genes. Although first, second and third codon positions usually differ in GC%, each position tends to maintain a uniform, gene-specific GC% value. Thus, within a genome, genic GC% values can cover a wide range. This is most evident at third codon positions, which are least constrained by amino acid encoding needs. In 1991, Wada and colleagues further noted that, within a phylogenetic group, genomic GC% values can also cover a wide range. This is again most evident at third codon positions. Thus, the dispersion of GC% values among genes within a genome matches the dispersion of GC% values among genomes within a phylogenetic group. Wada described the context-independence of plots of different codon position GC% values against total GC% as a "universal" characteristic. Several studies relate this to recombination. We have confirmed that third codon positions usually relate more to the genes that contain them than to the species. However, in genomes with extreme GC% values (low or high), third codon positions tend to maintain a constant GC%, thus relating more to the species than to the genes that contain them. Genes in an extreme-GC% genome collectively span a smaller GC% range, and mainly rely on first and second codon positions for differentiation as "microisochores". Our results are consistent with the view that differences in GC% serve to recombinationally isolate both genome sectors (facilitating gene duplication) and genomes (facilitating genome duplication, e.g. speciation). In intermediate-GC% genomes, conflict between the needs of the species and the needs of individual genes within that species is minimal. However, in extreme-GC% genomes there is a conflict, which is settled in favour of the species (i.e. group selection) rather than in favour of the gene (genic selection).  相似文献   

18.
Recent years have witnessed a breathtaking increase in the availability of genome sequence data, providing evidence of the highly duplicate nature of eukaryotic genomes. Plants are exceptional among eukaryotic organisms in that duplicate loci compose a large fraction of their genomes, partly because of the frequent occurrence of polyploidy (or whole-genome duplication) events. Tandem gene duplication and transposition have also contributed to the large number of duplicated genes in plant genomes. Evolutionary analyses allowed the dynamics of duplicate gene evolution to be studied and several models were proposed. It seems that, over time, many duplicated genes were lost and some of those that were retained gained new functions and/or expression patterns (neofunctionalization) or subdivided their functions and/or expression patterns between them (subfunctionalization). Recent studies have provided examples of genes that originated by duplication with successive diversification within plants. In this review, we focused on the TEL (TERMINAL EAR1-like) genes to illustrate such mechanisms. Emerged from the mei2 gene family, these TEL genes are likely to be land plant-specific. Phylogenetic analyses revealed one or two TEL copies per diploid genome. TEL gene degeneration and loss in several Angiosperm species such as in poplar and maize seem to have occurred. In Arabidopsis thaliana, whose genome experienced at least three polyploidy events followed by massive gene loss and genomic reorganization, two TEL genes were retained and two new shorter TEL-like (MCT) genes emerged. Molecular and expression analyses suggest for these genes sub- and neofunctionalization events, but confirmation will come from their functional characterization.  相似文献   

19.
20.
Fungi comprise a large monophyletic group of uni- and multicellular eukaryotic organisms in which many species are of economic or medical importance. Fungal genomes are variable in size (13–42 Mb), and multicellular species support true spatial and temporal cell-type-specific regulation of gene expression. In a 38.8-kbAspergillus nidulanscontiguous genomic DNA region, a transposable element and 12 potential genes were identified, 7 similar to genes in other organisms. This observation is consistent with the prediction that multicellular ascomycetous fungi harbor 8000–9000 genes in a 36-Mb average genome. Thus, the genomic DNA sequence of filamentous fungi will provide substantial amounts of genetic and functional information that is not available in yeast, for the human and other metazoan minimal gene complement.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号