Seven clusters in genomic triplet distributions |
| |
Authors: | Gorban Alexander N Zinovyev Andrei Y Popova Tatyana G |
| |
Affiliation: | Institute of Computational Modeling, Russian Academy of Science. |
| |
Abstract: | In several recent papers new gene-detection algorithms were proposed for detecting protein-coding regions without requiring a learning dataset of already known genes. The fact that unsupervised gene-detection is possible is closely connected to the existence of a cluster structure in oligomer frequency distributions. In this paper we study the cluster structure of several genomes in the space of their triplet frequencies, using a pure data exploration strategy. Several complete genomic sequences were analyzed, using the visualization of tables of triplet frequencies in a sliding window. The distribution of 64-dimensional vectors of triplet frequencies displays a well-detectable cluster structure. The structure was found to consist of seven clusters, corresponding to protein-coding information in three possible phases in one of the two complementary strands and in the non-coding regions with high accuracy (higher than 90% on nucleotide level). Visualizing and understanding the structure allows to analyze effectively the performance of different gene-prediction tools. Since the method does not require extraction of ORFs, it can be applied even for unassembled genomes. |
| |
Keywords: | |
本文献已被 PubMed 等数据库收录! |
|