首页 | 本学科首页   官方微博 | 高级检索  
   检索      


Vector space classification of DNA sequences
Authors:Müller H-M  Koonin S E
Institution:Division of Biology and W. K. Kellogg Radiation Laboratory, California Institute of Technology, 1201 East California Boulevard, Pasadena, CA 91125, USA. mueller@its.caltech.edu
Abstract:Revisiting the problem of intron-exon identification, we use a principal component analysis (PCA) to classify DNA sequences and present first results that validate our approach. Sequences are translated into document vectors that represent their word content; a principal component analysis then defines Gaussian-distributed sequence classes. The classification uses word content and variation of word usage to distinguish sequences. We test our approach with several data sets of genomic DNA and are able to classify introns and exons with an accuracy of up to 96%. We compare the method with the best traditional coding measure, the non-overlapping hexamer frequency count, and find that the PCA method produces better results. We also investigate the degree of cross-validation between different data sets of introns and exons and find evidence that the quality of a data set can be detected.
Keywords:Intron-exon identification  Principal component analysis  Genomics  Gene structure  Document vector  Clustering
本文献已被 ScienceDirect PubMed 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号