首页 | 本学科首页   官方微博 | 高级检索  
   检索      


Constructing benchmark test sets for biological sequence analysis using independent set algorithms
Authors:Samantha Petti  Sean R Eddy
Institution:1. NSF-Simons Center for the Mathematical and Statistical Analysis of Biology, Harvard University, Cambridge, Massachusetts, United States of America ; 2. Howard Hughes Medical Institute; Department of Molecular & Cellular Biology; and John A. Paulson School of Engineering and Applied Sciences, Harvard University, Cambridge, Massachusetts, United States of America ; University of Maryland Baltimore County, UNITED STATES
Abstract:Biological sequence families contain many sequences that are very similar to each other because they are related by evolution, so the strategy for splitting data into separate training and test sets is a nontrivial choice in benchmarking sequence analysis methods. A random split is insufficient because it will yield test sequences that are closely related or even identical to training sequences. Adapting ideas from independent set graph algorithms, we describe two new methods for splitting sequence data into dissimilar training and test sets. These algorithms input a sequence family and produce a split in which each test sequence is less than p% identical to any individual training sequence. These algorithms successfully split more families than a previous approach, enabling construction of more diverse benchmark datasets.
Keywords:
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号