Provably sensitive indexing strategies for biosequence similarity search.期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

Provably sensitive indexing strategies for biosequence similarity search.

Authors:	Jeremy Buhler

Affiliation:	Department of Computer Science and Engineering, Campus Box 1045, Washington University, One Brookings Drive, St. Louis, MO 63130, USA. jbuhler@cse.wustl.edu

Abstract:	Fast algorithms for pairwise biosequence similarity search frequently use filtering and indexing strategies to identify potential matches between a query sequence and a database. For the most part, these strategies are not informed by the substitution score matrices commonly used by comparison algorithms to assign numerical scores to pairs of aligned residues. Consequently, although many filtering strategies offer strong formal guarantees about their ability to detect pairs of sequences differing by few substitutions, these methods can make no guarantee of detecting pairs with high similarity scores. We describe a general technique, score simulation, to help resolve the tension between existing filtering techniques and the use of score matrices. Score simulation, using score matrices, maps ungapped similarity search problems to the simpler problem of finding pairs of strings that differ by few substitutions. Score simulation leads to indexing schemes for biosequences that permit efficient ungapped similarity search with arbitrary score matrices while maintaining strong formal guarantees of sensitivity. We introduce the LSH-ALL-PAIRS-SIM algorithm for finding local similarities in large biosequence collections and show that it is both computationally feasible and sensitive in practice.

Keywords:

设为首页 | 免责声明 | 关于勤云 | 加入收藏