An efficient algorithm for finding short approximate non-tandem repeats |
| |
Authors: | Adebiyi E F Jiang T Kaufmann M |
| |
Institution: | Wilhelm-Schickard-Institut für Informatik, Universit?t Tübingen, Sand 13, Tübingen, 72076, Germany. adebiyi@informatik.uni-tuebingen.de |
| |
Abstract: | We study the problem of approximate non-tandem repeat extraction. Given a long subject string S of length N over a finite alphabet Sigma and a threshold D, we would like to find all short substrings of S of length P that repeat with at most D differences, i.e., insertions, deletions, and mismatches. We give a careful theoretical characterization of the set of seeds (i.e., some maximal exact repeats) required by the algorithm, and prove a sublinear bound on their expected numbers. Using this result, we present a sub-quadratic algorithm for finding all short (i.e., of length O(log N)) approximate repeats. The running time of our algorithm is O(DN(3pow(epsilon)-1)log N), where epsilon = D/P and pow(epsilon) is an increasing, concave function that is 0 when epsilon = 0 and about 0.9 for DNA and protein sequences. |
| |
Keywords: | |
本文献已被 PubMed 等数据库收录! |
|