首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
GeneMark.hmm: new solutions for gene finding.   总被引:35,自引:0,他引:35       下载免费PDF全文
The number of completely sequenced bacterial genomes has been growing fast. There are computer methods available for finding genes but yet there is a need for more accurate algorithms. The GeneMark. hmm algorithm presented here was designed to improve the gene prediction quality in terms of finding exact gene boundaries. The idea was to embed the GeneMark models into naturally derived hidden Markov model framework with gene boundaries modeled as transitions between hidden states. We also used the specially derived ribosome binding site pattern to refine predictions of translation initiation codons. The algorithm was evaluated on several test sets including 10 complete bacterial genomes. It was shown that the new algorithm is significantly more accurate than GeneMark in exact gene prediction. Interestingly, the high gene finding accuracy was observed even in the case when Markov models of order zero, one and two were used. We present the analysis of false positive and false negative predictions with the caution that these categories are not precisely defined if the public database annotation is used as a control.  相似文献   

2.
3.
4.
Parametric methods for identifying laterally transferred genes exploit the directional mutational biases unique to each genome. Yet the development of new, more robust methods—as well as the evaluation and proper implementation of existing methods—relies on an arbitrary assessment of performance using real genomes, where the evolutionary histories of genes are not known. We have used the framework of a generalized hidden Markov model to create artificial genomes modeled after genuine genomes. To model a genome, “core” genes—those displaying patterns of mutational biases shared among large numbers of genes—are identified by a novel gene clustering approach based on the Akaike information criterion. Gene models derived from multiple “core” gene clusters are used to generate an artificial genome that models the properties of a genuine genome. Chimeric artificial genomes—representing those having experienced lateral gene transfer—were created by combining genes from multiple artificial genomes, and the performance of the parametric methods for identifying “atypical” genes was assessed directly. We found that a hidden Markov model that included multiple gene models, each trained on sets of genes representing the range of genotypic variability within a genome, could produce artificial genomes that mimicked the properties of genuine genomes. Moreover, different methods for detecting foreign genes performed differently—i.e., they had different sets of strengths and weaknesses—when identifying atypical genes within chimeric artificial genomes.  相似文献   

5.
6.
基于隐马氏模型对编码序列缺失与插入的检测(英)   总被引:2,自引:0,他引:2  
在基因组测序工作完成后,利用计算工具进行基因识别以及基因结构预测受到了越来越多人的重视.人们开发了大量的相关应用软件,如GenScan, Genemark, GRAIL等,这些软件在寻找新基因方面提供了很重要的线索.但基因的识别和预测问题仍未得到完全解决,当目标基因的编码序列有缺失和插入时,其预测结果和基因的实际结构相差很大.为了消除测序错误对预测结果的影响,希望能找出编码序列区的测序错误.基于这种想法,尝试根据DNA序列的一些统计特性,利用隐马尔科夫模型(Hidden Markov Model),引入缺失和插入状态,然后用Viterbi算法,从中找出含有缺失和插入的外显子序列片段.在常用的Burset/Guigo检测集进行检测,得到的结果在外显子水平上,Sn(sensitivity)和Sp(specificity)均达到84%以上.  相似文献   

7.
Fouts DE 《Nucleic acids research》2006,34(20):5839-5851
Phage_Finder, a heuristic computer program, was created to identify prophage regions in completed bacterial genomes. Using a test dataset of 42 bacterial genomes whose prophages have been manually identified, Phage_Finder found 91% of the regions, resulting in 7% false positive and 9% false negative prophages. A search of 302 complete bacterial genomes predicted 403 putative prophage regions, accounting for 2.7% of the total bacterial DNA. Analysis of the 285 putative attachment sites revealed tRNAs are targets for integration slightly more frequently (33%) than intergenic (31%) or intragenic (28%) regions, while tmRNAs were targeted in 8% of the regions. The most popular tRNA targets were Arg, Leu, Ser and Thr. Mapping of the insertion point on a consensus tRNA molecule revealed novel insertion points on the 5' side of the D loop, the 3' side of the anticodon loop and the anticodon. A novel method of constructing phylogenetic trees of phages and prophages was developed based on the mean of the BLAST score ratio (BSR) of the phage/prophage proteomes. This method verified many known bacteriophage groups, making this a useful tool for predicting the relationships of prophages from bacterial genomes.  相似文献   

8.
A complete and high‐quality genome reference sequence of an organism provides a solid foundation for a wide research community and determines the outcomes of relevant genomic, genetic, molecular and evolutionary research. Rice is an important food crop and a model plant for grasses, and therefore was the first chosen crop plant for whole genome sequencing. The genome of the japonica representative rice variety, Nipponbare, was sequenced using a gold standard, map‐based clone‐by‐clone strategy. However, although the Nipponbare reference sequence (RefSeq) has the best quality for existing crop genome sequences, it still contains many assembly errors and gaps. To improve the Nipponbare RefSeq, first a robust method is required to detect the hidden assembly errors. Through alignments between BAC‐end sequences (BESs) embedded in the Nipponbare bacterial artificial chromosome (BAC) physical map and the Nipponbare RefSeq, we detected locations on the Nipponbare RefSeq that were inversely matched with BESs and could therefore be candidates for spurious inversions of assembly. We performed further analysis of five potential locations and confirmed assembly errors at those locations; four of them, two on chr4 and two on chr11 of the Nipponbare RefSeq (IRGSP build 5), were found to be caused by reverse repetitive sequences flanking the locations. Our approach is effective in detecting spurious inversions in the Nipponbare RefSeq and can be applied for improving the sequence qualities of other genomes as well.  相似文献   

9.
Oligonucleotide usage in archaeal and bacterial genomes can be linked to a number of properties, including codon usage (trinucleotides), DNA base-stacking energy (dinucleotides), and DNA structural conformation (di- to tetranucleotides). We wanted to assess the statistical information potential of different DNA ‘word-sizes’ and explore how oligonucleotide frequencies differ in coding and non-coding regions. In addition, we used oligonucleotide frequencies to investigate DNA composition and how DNA sequence patterns change within and between prokaryotic organisms. Among the results found was that prokaryotic chromosomes can be described by hexanucleotide frequencies, suggesting that prokaryotic DNA is predominantly short range correlated, i.e., information in prokaryotic genomes is encoded in short oligonucleotides. Oligonucleotide usage varied more within AT-rich and host-associated genomes than in GC-rich and free-living genomes, and this variation was mainly located in non-coding regions. Bias (selectional pressure) in tetranucleotide usage correlated with GC content, and coding regions were more biased than non-coding regions. Non-coding regions were also found to be approximately 5.5% more AT-rich than coding regions, on average, in the 402 chromosomes examined. Pronounced DNA compositional differences were found both within and between AT-rich and GC-rich genomes. GC-rich genomes were more similar and biased in terms of tetranucleotide usage in non-coding regions than AT-rich genomes. The differences found between AT-rich and GC-rich genomes may possibly be attributed to lifestyle, since tetranucleotide usage within host-associated bacteria was, on average, more dissimilar and less biased than free-living archaea and bacteria.  相似文献   

10.
Transposon-insertion sequencing (TIS) is a powerful approach for deciphering genetic requirements for bacterial growth in different conditions, as it enables simultaneous genome-wide analysis of the fitness of thousands of mutants. However, current methods for comparative analysis of TIS data do not adjust for stochastic experimental variation between datasets and are limited to interrogation of annotated genomic elements. Here, we present ARTIST, an accessible TIS analysis pipeline for identifying essential regions that are required for growth under optimal conditions as well as conditionally essential loci that participate in survival only under specific conditions. ARTIST uses simulation-based normalization to model and compensate for experimental noise, and thereby enhances the statistical power in conditional TIS analyses. ARTIST also employs a novel adaptation of the hidden Markov model to generate statistically robust, high-resolution, annotation-independent maps of fitness-linked loci across the entire genome. Using ARTIST, we sensitively and comprehensively define Mycobacterium tuberculosis and Vibrio cholerae loci required for host infection while limiting inclusion of false positive loci. ARTIST is applicable to a broad range of organisms and will facilitate TIS-based dissection of pathways required for microbial growth and survival under a multitude of conditions.  相似文献   

11.
The ever growing number of completely sequenced prokaryotic genomes facilitates cross-species comparisons by genomic annotation algorithms. This paper introduces a new probabilistic framework for comparative genomic analysis and demonstrates its utility in the context of improving the accuracy of prokaryotic gene start site detection. Our frame work employs a product hidden Markov model (PROD-HMM) with state architecture to model the species-specific trinucleotide frequency patterns in sequences immediately upstream and downstream of a translation start site and to detect the contrasting non-synonymous (amino acid changing) and synonymous (silent) substitution rates that differentiate prokaryotic coding from intergenic regions. Depending on the intricacy of the features modeled by the hidden state architecture, intergenic, regulatory, promoter and coding regions can be delimited by this method. The new system is evaluated using a preliminary set of orthologous Pyrococcus gene pairs, for which it demonstrates an improved accuracy of detection. Its robustness is confirmed by analysis with cross-validation of an experimentally verified set of Escherichia coli K-12 and Salmonella thyphimurium LT2 orthologs. The novel architecture has a number of attractive features that distinguish it from previous comparative models such as pair-HMMs.  相似文献   

12.
Similarity between related genomes may carry information on selective constraint in each of them. We analysed patterns of similarity between several homologous regions of Caenorhabditis elegans and C. briggsae genomes. All homologous exons are quite similar. Alignments of introns and of intergenic sequences contain long gaps, segments where similarity is low and close to that between random sequences aligned using the same parameters, and segments of high similarity. Conservative estimates of the fractions of selectively constrained nucleotides are 72%, 17% and 18% for exons, introns and intergenic sequences, respectively. This implies that the total number of constrained nucleotides within non-coding sequences is comparable to that within coding sequences, so that at least one-third of nucleotides in C. elegans and C. briggsae genomes are under strong stabilizing selection.  相似文献   

13.
We present an independent evaluation of six recent hidden Markov model (HMM) genefinders. Each was tested on the new dataset (FSH298), the results of which showed no dramatic improvement over the genefinders tested five years ago. In addition, we introduce a comprehensive taxonomy of predicted exons and classify each resulting exon accordingly. These results are useful in measuring (with finer granularity) the effects of changes in a genefinder. We present an analysis of these results and identify four patterns of inaccuracy common in all HMM-based results.  相似文献   

14.
We have used a hidden Markov model (HMM) to identify the consensus sequence of the RpoD promoters in the genome of Campylobacter jejuni. The identified promoter consensus sequence is unusual compared to other bacteria, in that the region upstream of the TATA-box does not contain a conserved -35 region, but shows a very strong periodic variation in the AT-content and semi-conserved T-stretches, with a period of 10-11 nucleotides. The TATA-box is in some, but not all cases, preceded by a TGx, similar to an extended -10 promoter.We predicted a total of 764 presumed RpoD promoters in the C.jejuni genome, of which 654 were located upstream of annotated genes. A similar promoter was identified in Helicobacter pylori, a close phylogenetic relative of Campylobacter, but not in Escherichia coli, Vibrio cholerae, or six other Proteobacterial genomes, or in Staphylococcus aureus. We used upstream regions of high confidence genes as training data (n=529, for the C.jejuni genome). We found it necessary to limit the training set to genes that are preceded by an intergenic region of >100bp or by a gene oriented in the opposite direction to be able to identify a conserved sequence motif, and ended up with a training set of 175 genes. This leads to the conclusion that the remaining genes (354) are more rarely preceded by a (RpoD) promoter, and consequently that operon structure may be more widespread in C.jejuni than has been assumed by others.Structural predictions of the regions upstream of the TATA-box indicates a region of highly curved DNA, and we assume that this facilitates the wrapping of the DNA around the RNA polymerase holoenzyme, and offsets the absence of a conserved -35 binding motif.  相似文献   

15.
16.

Background  

Enrichment of loci by DNA hybridization-capture, followed by high-throughput sequencing, is an important tool in modern genetics. Currently, the most common targets for enrichment are the protein coding exons represented by the consensus coding DNA sequence (CCDS). The CCDS, however, excludes many actual or computationally predicted coding exons present in other databases, such as RefSeq and Vega, and non-coding functional elements such as untranslated and regulatory regions. The number of variants per base pair (variant density) and our ability to interrogate regions outside of the CCDS regions is consequently less well understood.  相似文献   

17.
The split structure of most mammalian protein-coding genes allows for the potential to produce multiple different mRNA and protein isoforms from a single gene locus through the process of alternative splicing (AS). We propose a computational approach called UNCOVER based on a pair hidden Markov model to discover conserved coding exonic sequences subject to AS that have so far gone undetected. Applying UNCOVER to orthologous introns of known human and mouse genes predicts skipped exons or retained introns present in both species, while discriminating them from conserved noncoding sequences. The accuracy of the model is evaluated on a curated set of genes with known conserved AS events. The prediction of skipped exons in the approximately 1% of the human genome represented by the ENCODE regions leads to more than 50 new exon candidates. Five novel predicted AS exons were validated by RT-PCR and sequencing analysis of 15 introns with strong UNCOVER predictions and lacking EST evidence. These results imply that a considerable number of conserved exonic sequences and associated isoforms are still completely missing from the current annotation of known genes. UNCOVER also identifies a small number of candidates for conserved intron retention.  相似文献   

18.
For the past one decade, there has been considerable explosion of interest in searching novel regulatory elements in the intergenic region between the protein coding regions. The microbial genomes are the most exploited in terms of intergenic (noncoding) regions due to its less complexity. We think, the increasing pace of genome sequencing calls for a tool which will be useful for the extraction of intergenic regions. IntergenicS (Intergenic Sequence) is a tool which can extract the intergenic regions of microbial genomes at NCBI. All the unannotated regions between annotated protein coding genes and noncoding RNA genes can be extracted. It also deals with the calculation of GC base composition of the intergenic regions. This will be a useful tool for the analysis of noncoding regions of both bacterial and archael genomes.  相似文献   

19.
Genetic heterogeneity in a mixed sample of tumor and normal DNA can confound characterization of the tumor genome. Numerous computational methods have been proposed to detect aberrations in DNA samples from tumor and normal tissue mixtures. Most of these require tumor purities to be at least 10–15%. Here, we present a statistical model to capture information, contained in the individual''s germline haplotypes, about expected patterns in the B allele frequencies from SNP microarrays while fully modeling their magnitude, the first such model for SNP microarray data. Our model consists of a pair of hidden Markov models—one for the germline and one for the tumor genome—which, conditional on the observed array data and patterns of population haplotype variation, have a dependence structure induced by the relative imbalance of an individual''s inherited haplotypes. Together, these hidden Markov models offer a powerful approach for dealing with mixtures of DNA where the main component represents the germline, thus suggesting natural applications for the characterization of primary clones when stromal contamination is extremely high, and for identifying lesions in rare subclones of a tumor when tumor purity is sufficient to characterize the primary lesions. Our joint model for germline haplotypes and acquired DNA aberration is flexible, allowing a large number of chromosomal alterations, including balanced and imbalanced losses and gains, copy-neutral loss-of-heterozygosity (LOH) and tetraploidy. We found our model (which we term J-LOH) to be superior for localizing rare aberrations in a simulated 3% mixture sample. More generally, our model provides a framework for full integration of the germline and tumor genomes to deal more effectively with missing or uncertain features, and thus extract maximal information from difficult scenarios where existing methods fail.  相似文献   

20.
Since a large number of computationally predicted exons are not supported by existing sequence (e.g. ESTs) or experimental (e.g. expression analysis) data they need to be validated by other methods. ETOPE is designed to test computational predictions by using signals that have not been included in any current computational prediction method. The test is based on the ratio of non-synonymous to synonymous substitution rates between sequences from different genomes. It has been previously shown, by empirical data and computer simulation, to be a powerful criterion for identifying protein-coding regions. The ETOPE is available at http://nekrut.uchicago.edu/etope/.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号