首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Exhaustive gene identification is a fundamental goal in all metagenomics projects. However, most metagenomic sequences are unassembled anonymous fragments, and conventional gene-finding methods cannot be applied. We have developed a prokaryotic gene-finding program, MetaGene, which utilizes di-codon frequencies estimated by the GC content of a given sequence with other various measures. MetaGene can predict a whole range of prokaryotic genes based on the anonymous genomic sequences of a few hundred bases, with a sensitivity of 95% and a specificity of 90% for artificial shotgun sequences (700 bp fragments from 12 species). MetaGene has two sets of codon frequency interpolations, one for bacteria and one for archaea, and automatically selects the proper set for a given sequence using the domain classification method we propose. The domain classification works properly, correctly assigning domain information to more than 90% of the artificial shotgun sequences. Applied to the Sargasso Sea dataset, MetaGene predicted almost all of the annotated genes and a notable number of novel genes. MetaGene can be applied to wide variety of metagenomic projects and expands the utility of metagenomics.  相似文献   

2.
Computational methods seeking to automatically determine the properties (functional, structural, physicochemical, etc.) of a protein directly from the sequence have long been the focus of numerous research groups. With the advent of advanced sequencing methods and systems, the number of amino acid sequences that are being deposited in the public databases has been increasing steadily. This has in turn generated a renewed demand for automated approaches that can annotate individual sequences and complete genomes quickly, exhaustively and objectively. In this paper, we present one such approach that is centered around and exploits the Bio-Dictionary, a collection of amino acid patterns that completely covers the natural sequence space and can capture functional and structural signals that have been reused during evolution, within and across protein families. Our annotation approach also makes use of a weighted, position-specific scoring scheme that is unaffected by the over-representation of well-conserved proteins and protein fragments in the databases used. For a given query sequence, the method permits one to determine, in a single pass, the following: local and global similarities between the query and any protein already present in a public database; the likeness of the query to all available archaeal/ bacterial/eukaryotic/viral sequences in the database as a function of amino acid position within the query; the character of secondary structure of the query as a function of amino acid position within the query; the cytoplasmic, transmembrane or extracellular behavior of the query; the nature and position of binding domains, active sites, post-translationally modified sites, signal peptides, etc. In terms of performance, the proposed method is exhaustive, objective and allows for the rapid annotation of individual sequences and full genomes. Annotation examples are presented and discussed in Results, including individual queries and complete genomes that were released publicly after we built the Bio-Dictionary that is used in our experiments. Finally, we have computed the annotations of more than 70 complete genomes and made them available on the World Wide Web at http://cbcsrv.watson.ibm.com/Annotations/.  相似文献   

3.
4.
It is known that while the programs used to find genes in prokaryotic genomes reliably map protein-coding regions, they often fail in the exact determination of gene starts. This problem is further aggravated by sequencing errors, most notably insertions and deletions leading to frame-shifts. Therefore, the exact mapping of gene starts and identification of frame-shifts are important problems of the computer-assisted functional analysis of newly sequenced genomes. Here we review methods of gene recognition and describe a new algorithm for correction of gene starts and identification of frame-shifts in prokaryotic genomes. The algorithm is based on the comparison of nucleotide and protein sequences of homologous genes from related organisms, using the assumption that the rate of evolutionary changes in protein-coding regions is lower than that in non-coding regions. A dynamic programming algorithm is used to align protein sequences obtained by formal translation of genomic nucleotide sequences. The possibility of frame-shifts is taken into account. The algorithm was tested on several groups of related organisms: gamma-proteobacteria, the Bacillus/Clostridium group, and three Pyrococcus genomes. The testing demonstrated that, dependent or a genome, 1-10 per cent of genes have incorrect starts or contain frame-shifts. The algorithm is implemented in the program package Orthologator-GeneCorrector.  相似文献   

5.
Connected gene neighborhoods in prokaryotic genomes   总被引:11,自引:1,他引:11  
A computational method was developed for delineating connected gene neighborhoods in bacterial and archaeal genomes. These gene neighborhoods are not typically present, in their entirety, in any single genome, but are held together by overlapping, partially conserved gene arrays. The procedure was applied to comparing the orders of orthologous genes, which were extracted from the database of Clusters of Orthologous Groups of proteins (COGs), in 31 prokaryotic genomes and resulted in the identification of 188 clusters of gene arrays, which included 1001 of 2890 COGs. These clusters were projected onto actual genomes to produce extended neighborhoods including additional genes, which are adjacent to the genes from the clusters and are transcribed in the same direction, which resulted in a total of 2387 COGs being included in the neighborhoods. Most of the neighborhoods consist predominantly of genes united by a coherent functional theme, but also include a minority of genes without an obvious functional connection to the main theme. We hypothesize that although some of the latter genes might have unsuspected roles, others are maintained within gene arrays because of the advantage of expression at a level that is typical of the given neighborhood. We designate this phenomenon ‘genomic hitchhiking’. The largest neighborhood includes 79 genes (COGs) and consists of overlapping, rearranged ribosomal protein superoperons; apparent genome hitchhiking is particularly typical of this neighborhood and other neighborhoods that consist of genes coding for translation machinery components. Several neighborhoods involve previously undetected connections between genes, allowing new functional predictions. Gene neighborhoods appear to evolve via complex rearrangement, with different combinations of genes from a neighborhood fixed in different lineages.  相似文献   

6.
Computational gene finding in plants   总被引:10,自引:0,他引:10  
  相似文献   

7.
PRODORIC: prokaryotic database of gene regulation   总被引:11,自引:0,他引:11       下载免费PDF全文
  相似文献   

8.
9.
Glucanase gene diversity in prokaryotic and eukaryotic organisms   总被引:4,自引:0,他引:4  
A number of bacteria and eukaryotes produce extracellular enzymes that degrade various types of polysaccharides including the glucans starch, cellulose and hemicellulose (xylan). The similarities in the modes of expression and specificity of enzyme classes, such as amylase, cellulose and xylanase, suggest common genetic origins for particular activities. Our determination of the extent of similarity between these glucanases suggests that such data may be of very limited use in describing the early evolution of these proteins. The great diversity of these proteins does allow identification of their most highly conserved (and presumably functionally important) regions.  相似文献   

10.
11.
As more investigators conduct extensive whole-genome linkage scans for complex traits, interest is growing in meta-analysis as a way of integrating the weak or conflicting evidence from multiple studies. However, there is a bias in the most commonly used meta-analysis linkage technique (i.e., Fisher's [1925] method of combining of P values) when it is applied to many nonparametric (i.e., model free) linkage results. The bias arises in those methods (e.g., variance components, affected sib pair, extremely discordant sib pairs, etc.) that truncate all "negative evidence against linkage" into the single value of LOD = 0. If incorrectly handled, this bias can artificially inflate or deflate the combined meta-analysis linkage results for any given locus. This is an especially troublesome problem in the context of a genome scan, since LOD = 0 is expected to occur over half the unlinked genome. The bias can be overcome (nearly) completely by simply interpreting LOD = 0 as a P value of 1divided by 2ln(2) is approximately equal to .72 in Fisher's formula.  相似文献   

12.
Large-scale prokaryotic gene prediction and comparison to genome annotation   总被引:4,自引:0,他引:4  
MOTIVATION: Prokaryotic genomes are sequenced and annotated at an increasing rate. The methods of annotation vary between sequencing groups. It makes genome comparison difficult and may lead to propagation of errors when questionable assignments are adapted from one genome to another. Genome comparison either on a large or small scale would be facilitated by using a single standard for annotation, which incorporates a transparency of why an open reading frame (ORF) is considered to be a gene. RESULTS: A total of 143 prokaryotic genomes were scored with an updated version of the prokaryotic genefinder EasyGene. Comparison of the GenBank and RefSeq annotations with the EasyGene predictions reveals that in some genomes up to approximately 60% of the genes may have been annotated with a wrong start codon, especially in the GC-rich genomes. The fractional difference between annotated and predicted confirms that too many short genes are annotated in numerous organisms. Furthermore, genes might be missing in the annotation of some of the genomes. We predict 41 of 143 genomes to be over-annotated by >5%, meaning that too many ORFs are annotated as genes. We also predict that 12 of 143 genomes are under-annotated. These results are based on the difference between the number of annotated genes not found by EasyGene and the number of predicted genes that are not annotated in GenBank. We argue that the average performance of our standardized and fully automated method is slightly better than the annotation.  相似文献   

13.
目的探讨获取小鼠Lin28蛋白的方法。方法 提取8.5 d ICR小鼠胚胎mRNA后反转录为cDNA序列,用一对两端引入特定酶切位点(NcoⅠ及XhoⅠ)引物,从该cDNA中扩增出Lin28基因编码区序列;将获得的Lin28基因编码区序列克隆到pMD18-T载体上。对质粒双酶切回收其中Lin28基因片段,与pET-30a(+)载体相连接并转化Rosetta(DE3)型大肠杆菌,用IPTG诱导表达,最后采用SDS-PAGE对表达结果进行分析。结果对所克隆的Lin28蛋白编码区的DNA序列分析表明,Lin28 CDS区包括终止密码子在内为630 bp,与参照DNA(NM145833)相比同源性为99.37%,与参照氨基酸序列相比同源性为100%;在IPTG诱导下pET-30a(+)-Lin28重组质粒可表达与预期相符的约为27.5×103的蛋白质。结论利用克隆的小鼠Lin28基因,采用原核表达方法,成功获得小鼠Lin28蛋白,为进一步开展以重组蛋白诱导体细胞重编程研究奠定基础。  相似文献   

14.
Comparative genometrics of microorganisms is a relatively new area, in which genome properties are translated into numerical indexes. Such indexes can be used for a comprehensive and comparative analysis of microbial genomes, contributing to the understanding of their evolution. This work presents a new method for quantitative determination of gene strand bias in prokaryotic chromosomes, in which data transformation of gene position skew leads to a numerical index that can be applied to quantitative comparisons of genome organization. It was applied in the comparative analysis of 49 completely sequenced Firmicutes genomes, allowing the distinction of groups defined according to their patterns of gene strand preference. The resulting groups revealed that, regarding gene strand bias, reduced genomes are, in general, the more disordered among Firmicutes, while genomes of extremophile organisms comprehend those with the highest degree of genome organization in this phylum.  相似文献   

15.
A gene team is a set of genes that appear in two or more species, possibly in a different order yet with the distance between adjacent genes in the team for each chromosome always no more than a certain threshold δ. A gene team tree is a succinct way to represent all gene teams for every possible value of δ. In this paper, improved algorithms are presented for the problem of finding the gene teams of two chromosomes and the problem of constructing a gene team tree of two chromosomes. For the problem of finding gene teams, Beal et al. had an O(n lg2 n)-time algorithm. Our improved algorithm requires O(n lg t) time, where t ≤ n is the number of gene teams. For the problem of constructing a gene team tree, Zhang and Leong had an O(n lg2 n)-time algorithm. Our improved algorithm requires O(n lg n lglg n) time. Similar to Beal et al.'s gene team algorithm and Zhang and Leong's gene team tree algorithm, our improved algorithms can be extended to k chromosomes with the time complexities increased only by a factor of k.  相似文献   

16.
17.
18.
周海廷 《生物技术》2002,12(5):33-34
用非数学语言描述了隐马尔科夫过程(hidden mark-ov model,HMM),介绍了HMM用于基因识别的原理及基于HMM开发的,比较常用的基因识别程序。  相似文献   

19.
Interpolated Markov models for eukaryotic gene finding.   总被引:21,自引:0,他引:21  
Computational gene finding research has emphasized the development of gene finders for bacterial and human DNA. This has left genome projects for some small eukaryotes without a system that addresses their needs. This paper reports on a new system, GlimmerM, that was developed to find genes in the malaria parasite Plasmodium falciparum. Because the gene density in P. falciparum is relatively high, the system design was based on a successful bacterial gene finder, Glimmer. The system was augmented with specially trained modules to find splice sites and was trained on all available data from the P. falciparum genome. Although a precise evaluation of its accuracy is impossible at this time, laboratory tests (using RT-PCR) on a small selection of predicted genes confirmed all of those predictions. With the rapid progress in sequencing the genome of P. falciparum, the availability of this new gene finder will greatly facilitate the annotation process.  相似文献   

20.
Phylogenomic studies produce increasingly large phylogenetic forests of trees with patchy taxonomical sampling. Typically, prokaryotic data generate thousands of gene trees of all sizes that are difficult, if not impossible, to root. Their topologies do not match the genealogy of lineages, as they are influenced not only by duplication, losses, and vertical descent but also by lateral gene transfer (LGT) and recombination. Because this complexity in part reflects the diversity of evolutionary processes, the study of phylogenetic forests is thus a great opportunity to improve our understanding of prokaryotic evolution. Here, we show how the rich evolutionary content of such novel phylogenetic objects can be exploited through the development of new approaches designed specifically for extracting the multiple evolutionary signals present in the forest of life, that is, by slicing up trees into remarkable bits and pieces: clans, slices, and clips. We harvested a forest of 6,901 unrooted gene trees comprising up to 100 prokaryotic genomes (41 archaea and 59 bacteria) to search for evolutionary events that a species tree would not account for. We identified 1) trees and partitions of trees that reflected the lifestyle of organisms rather than their taxonomy, 2) candidate lifestyle-specific genetic modules, used by distinct unrelated organisms to adapt to the same environment, 3) gene families, nonrandomly distributed in the functional space, that were frequently exchanged between archaea and bacteria, sometimes without major changes in their sequences. Finally, 4) we reconstructed polarized networks of genetic partnerships between archaea and bacteria to describe some of the rules affecting LGT between these two Domains.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号