首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

The Generalized Hidden Markov Model (GHMM) has proven a useful framework for the task of computational gene prediction in eukaryotic genomes, due to its flexibility and probabilistic underpinnings. As the focus of the gene finding community shifts toward the use of homology information to improve prediction accuracy, extensions to the basic GHMM model are being explored as possible ways to integrate this homology information into the prediction process. Particularly prominent among these extensions are those techniques which call for the simultaneous prediction of genes in two or more genomes at once, thereby increasing significantly the computational cost of prediction and highlighting the importance of speed and memory efficiency in the implementation of the underlying GHMM algorithms. Unfortunately, the task of implementing an efficient GHMM-based gene finder is already a nontrivial one, and it can be expected that this task will only grow more onerous as our models increase in complexity.

Results

As a first step toward addressing the implementation challenges of these next-generation systems, we describe in detail two software architectures for GHMM-based gene finders, one comprising the common array-based approach, and the other a highly optimized algorithm which requires significantly less memory while achieving virtually identical speed. We then show how both of these architectures can be accelerated by a factor of two by optimizing their content sensors. We finish with a brief illustration of the impact these optimizations have had on the feasibility of our new homology-based gene finder, TWAIN.

Conclusions

In describing a number of optimizations for GHMM-based gene finders and making available two complete open-source software systems embodying these methods, it is our hope that others will be more enabled to explore promising extensions to the GHMM framework, thereby improving the state-of-the-art in gene prediction techniques.  相似文献   

2.
Development of joint application strategies for two microbial gene finders   总被引:2,自引:0,他引:2  
MOTIVATION: As a starting point in annotation of bacterial genomes, gene finding programs are used for the prediction of functional elements in the DNA sequence. Due to the faster pace and increasing number of genome projects currently underway, it is becoming especially important to have performant methods for this task. RESULTS: This study describes the development of joint application strategies that combine the strengths of two microbial gene finders to improve the overall gene finding performance. Critica is very specific in the detection of similarity-supported genes as it uses a comparative sequence analysis-based approach. Glimmer employs a very sophisticated model of genomic sequence properties and is sensitive also in the detection of organism-specific genes. Based on a data set of 113 microbial genome sequences, we optimized a combined application approach using different parameters with relevance to the gene finding problem. This results in a significant improvement in specificity while there is similarity in sensitivity to Glimmer. The improvement is especially pronounced for GC rich genomes. The method is currently being applied for the annotation of several microbial genomes. AVAILABILITY: The methods described have been implemented within the gene prediction component of the GenDB genome annotation system.  相似文献   

3.

Background  

Generalized hidden Markov models (GHMMs) appear to be approaching acceptance as a de facto standard for state-of-the-art ab initio gene finding, as evidenced by the recent proliferation of GHMM implementations. While prevailing methods for modeling and parsing genes using GHMMs have been described in the literature, little attention has been paid as of yet to their proper training. The few hints available in the literature together with anecdotal observations suggest that most practitioners perform maximum likelihood parameter estimation only at the local submodel level, and then attend to the optimization of global parameter structure using some form of ad hoc manual tuning of individual parameters.  相似文献   

4.
MOTIVATION: Cellular pathways behave coordinated regulation activity, and some reported works also have affirmed that genes in the same pathway have similar expression pattern. However, the complexity of biological systems regulation actually causes expression relationships between genes to display multiple patterns, such as linear, non-linear, local, global, linear with time-delayed, non-linear with time-delayed, monotonic and non-monotonic, which should be the explicit representation of cellular inner regulation mechanism in mRNA level. To investigate the relationship between different patterns, our work aims to systematically reveal gene-expression relationship patterns in cellular pathways and to check for the existence of dominating gene-expression pattern. By a large scale analysis of genes expression in three eukaryotic species, Saccharomyces cerevisiae, Caenorhabditis elegans and Human, we constructed gene coexpression patterns tree to systematically and hierarchically illustrate the different patterns and their interrelations. RESULTS: The results show that the linear is the dominating expression pattern in the same pathway. The time-shifted pattern is another important relationship pattern. Many genes from the different pathway also present coexpression patterns. The non-linear, non-monotonic and time-delayed relationship patterns reflect the remote interactions between the genes in cellular processes. Gene coexpression phenomena in the same pathways are diverse in different species. Genes in S.cerevisiae and C.elegans present strong coexpression relationships, especially in C.elegans, coexpression is more universal and stronger due to its special array of genes. However in Human, gene coexpression is not apparent and the human genome involves more complicated functional relationships. In conclusion, different patterns corresponding to different coordinating behaviors coexist. The patterns trees of different species give us comprehensive insight and understanding of genes expression activity in the cellular society.  相似文献   

5.
Computational gene prediction and identifying alternatively spliced isoforms have always been a challenging task. In this paper, we describe the performance of three gene/exon finding programmes namely Fex, Gen view2 and Gene builder capable of predicting open reading frames or exons for a given set of sequences from C. elegans genome. The predicted exons were compared with the 'sequencing consortium' identified exons and degree of consensus among them is discussed. We found that exon prediction by Fex was similar to the consortium prediction as compared to Gen view2 and Gene builder results. Interestingly, some exons (six exons in five genes) predicted positive only by Fex and not by the 'sequencing consortium' are found at the C. elegans EST database. This data is critical for further debate and discussion on gene finding in C. elegans.  相似文献   

6.
7.
8.
Computer programs for eukaryotic gene prediction   总被引:3,自引:0,他引:3  
Seven popular programs for gene prediction in eukaryotic organisms are described and evaluated on the basis of availability for in-house and on-line use and prediction accuracy. This report outlines generally applicable approaches to computational gene prediction and known limitations in this field.  相似文献   

9.
MOTIVATION: Many entity taggers and information extraction systems make use of lists of terms of entities such as people, places, genes or chemicals. These lists have traditionally been constructed manually. We show that distributional clustering methods which group words based on the contexts that they appear in, including neighboring words and syntactic relations extracted using a shallow parser, can be used to aid in the construction of term lists. RESULTS: Experiments on learning lists of terms and using them as part of a gene tagger on a corpus of abstracts from the scientific literature show that our automatically generated term lists significantly boost the precision of a state-of-the-art CRF-based gene tagger to a degree that is competitive with using hand curated lists and boosts recall to a degree that surpasses that of the hand-curated lists. Our results also show that these distributional clustering methods do not generate lists as helpful as those generated by supervised techniques, but that they can be used to complement supervised techniques so as to obtain better performance. AVAILABILITY: The code used in this paper is available from http://www.cis.upenn.edu/datamining/software_dist/autoterm/  相似文献   

10.
11.
We develop a method to predict and validate gene models using PacBio single-molecule, real-time (SMRT) cDNA reads. Ninety-eight percent of full-insert SMRT reads span complete open reading frames. Gene model validation using SMRT reads is developed as automated process. Optimized training and prediction settings and mRNA-seq noise reduction of assisting Illumina reads results in increased gene prediction sensitivity and precision. Additionally, we present an improved gene set for sugar beet (Beta vulgaris) and the first genome-wide gene set for spinach (Spinacia oleracea). The workflow and guidelines are a valuable resource to obtain comprehensive gene sets for newly sequenced genomes of non-model eukaryotes.

Electronic supplementary material

The online version of this article (doi:10.1186/s13059-015-0729-7) contains supplementary material, which is available to authorized users.  相似文献   

12.
The review describes several modules of the GeneExpress integrated computer system concerning the regulation of gene expression in eukaryotes. Approaches to the presentation of experimental data in databases are considered. The employment of GeneExpress in computer analysis and modeling of the organization and function of genetic systems is illustrated with examples. GeneExpress is available at http://wwwmgs.bionet.nsc.ru/mgs/gnw/.  相似文献   

13.
Interpolated Markov models for eukaryotic gene finding.   总被引:21,自引:0,他引:21  
Computational gene finding research has emphasized the development of gene finders for bacterial and human DNA. This has left genome projects for some small eukaryotes without a system that addresses their needs. This paper reports on a new system, GlimmerM, that was developed to find genes in the malaria parasite Plasmodium falciparum. Because the gene density in P. falciparum is relatively high, the system design was based on a successful bacterial gene finder, Glimmer. The system was augmented with specially trained modules to find splice sites and was trained on all available data from the P. falciparum genome. Although a precise evaluation of its accuracy is impossible at this time, laboratory tests (using RT-PCR) on a small selection of predicted genes confirmed all of those predictions. With the rapid progress in sequencing the genome of P. falciparum, the availability of this new gene finder will greatly facilitate the annotation process.  相似文献   

14.
Hosono K  Sasaki T  Minoshima S  Shimizu N 《Gene》2004,340(1):31-43
During comprehensive sequence analysis of human chromosome 22, we identified a novel gene family consisting of five members (YPEL1 through YPEL5) which has high homology with Drosophila yippee gene. We cloned and sequenced cDNAs for all five genes and determined their exon/intron organization. These YPEL genes showed high homology (43.8-96.6%) at amino acid sequence level among them. Mouse counterparts (Ypel1 through Ypel5) were also identified in the syntenic region of mouse chromosomes and their cDNAs were cloned and sequenced. Each of five pairs of human/mouse orthologs revealed extremely high homology. Thus, we named these genes as members of YPEL gene family. We searched YPEL family genes from the public databases, and found 100 genes from 68 species including animals, plants and fungi. Amino acid sequences of these 100 YPEL proteins were extremely similar and a consensus sequence of C-X(2)-C-X(19)-G-X(3)-L-X(5)-N-X(13)-G-X(8)-C-X(2)-C-X(4)-GWXY-X(10)-K-X(6)-E was established for all the YPEL family proteins without exception. Interestingly, the indirect immunofluorescent staining indicated that YPEL1-4 proteins are localized to the centrosome and nucleolus during interphase and at several dot-like structures around the mitotic apparatus during mitotic phase of COS-7 cells. YPEL5 protein is localized to the centrosome and nucleus during interphase and at the mitotic spindle during mitosis of the same cell line. Thus, the YPEL family proteins were found in essentially all the eukaryotes and hence they must play important roles in the maintenance of life. The subcellular localization of YPEL proteins in association with centrosome or mitotic spindle suggests a novel function involved in the cell division.  相似文献   

15.
A bacterium (strain A1) isolated from a ditch synthesized three types of intracellular alginate lyases: A1-I (molecular weight [M.W.] 60,000), A1-II-2 (M.W. 25,000) and A1-III (M.W. 38,000). The nucleotide sequence of the gene for A1-I lyase, which has been cloned in Escherichia coli DH1 was determined. The open reading frame of the gene encoded 622 amino acids with a calculated M.W. of 69,153. The N-terminal amino acid sequence of A1-I lyase purified from strain A1 or E. coli DH1 cells transformed with the A1-I lyase gene was consistent with the deduced sequence from 55His to 74Ala, indicating that the A1-I lyase was synthesized as a precursor with a M.W. of 69,153 and then processed to a mature form with a M.W. of 63,681. The N-terminal sequence of the first twenty amino acids of A1-III lyase was found to match that of A1-I lyase. The N-terminal sequence of the first twenty amino acids of A1-II-2 lyase was consistent with the deduced amino acid sequence from 414Ala to 433Val in the nucleotide sequence of the A1-I lyase gene. These results indicated that the A1-I lyase was further processed to generate A1-II-2 and A1-III lyase species.  相似文献   

16.
17.
18.
The origins of eukaryotic gene structure   总被引:17,自引:0,他引:17  
Most of the phenotypic diversity that we perceive in the natural world is directly attributable to the peculiar structure of the eukaryotic gene, which harbors numerous embellishments relative to the situation in prokaryotes. The most profound changes include introns that must be spliced out of precursor mRNAs, transcribed but untranslated leader and trailer sequences (untranslated regions), modular regulatory elements that drive patterns of gene expression, and expansive intergenic regions that harbor additional diffuse control mechanisms. Explaining the origins of these features is difficult because they each impose an intrinsic disadvantage by increasing the genic mutation rate to defective alleles. To address these issues, a general hypothesis for the emergence of eukaryotic gene structure is provided here. Extensive information on absolute population sizes, recombination rates, and mutation rates strongly supports the view that eukaryotes have reduced genetic effective population sizes relative to prokaryotes, with especially extreme reductions being the rule in multicellular lineages. The resultant increase in the power of random genetic drift appears to be sufficient to overwhelm the weak mutational disadvantages associated with most novel aspects of the eukaryotic gene, supporting the idea that most such changes are simple outcomes of semi-neutral processes rather than direct products of natural selection. However, by establishing an essentially permanent change in the population-genetic environment permissive to the genome-wide repatterning of gene structure, the eukaryotic condition also promoted a reliable resource from which natural selection could secondarily build novel forms of organismal complexity. Under this hypothesis, arguments based on molecular, cellular, and/or physiological constraints are insufficient to explain the disparities in gene, genomic, and phenotypic complexity between prokaryotes and eukaryotes.  相似文献   

19.
BB Hanberry  HS He  BJ Palik 《PloS one》2012,7(8):e44486

Background

Species distribution models require selection of species, study extent and spatial unit, statistical methods, variables, and assessment metrics. If absence data are not available, another important consideration is pseudoabsence generation. Different strategies for pseudoabsence generation can produce varying spatial representation of species.

Methodology

We considered model outcomes from four different strategies for generating pseudoabsences. We generating pseudoabsences randomly by 1) selection from the entire study extent, 2) a two-step process of selection first from the entire study extent, followed by selection for pseudoabsences from areas with predicted probability <25%, 3) selection from plots surveyed without detection of species presence, 4) a two-step process of selection first for pseudoabsences from plots surveyed without detection of species presence, followed by selection for pseudoabsences from the areas with predicted probability <25%. We used Random Forests as our statistical method and sixteen predictor variables to model tree species with at least 150 records from Forest Inventory and Analysis surveys in the Laurentian Mixed Forest province of Minnesota.

Conclusions

Pseudoabsence generation strategy completely affected the area predicted as present for species distribution models and may be one of the most influential determinants of models. All the pseudoabsence strategies produced mean AUC values of at least 0.87. More importantly than accuracy metrics, the two-step strategies over-predicted species presence, due to too much environmental distance between the pseudoabsences and recorded presences, whereas models based on random pseudoabsences under-predicted species presence, due to too little environmental distance between the pseudoabsences and recorded presences. Models using pseudoabsences from surveyed plots produced a balance between areas with high and low predicted probabilities and the strongest relationship between density and area with predicted probabilities ≥75%. Because of imperfect accuracy assessment, the best assessment currently may be evaluation of whether the species has been sufficiently but not excessively predicted to occur.  相似文献   

20.
Structural basis of eukaryotic gene transcription   总被引:7,自引:0,他引:7  
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号