Lower bounds for text indexing with mismatches and differences

Vincent Cohen-Addad, Laurent Feuilloley, Tatiana Starikovskaya.

SODA 2019: Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, San Diego, California, USA, January 6-9, 2019
doi: 10.1137/1.9781611975482.70


ArXiv version Conference version CoA Slides


In this paper we study lower bounds for the fundamental problem of text indexing with mismatches and differences. In this problem we are given a long string of length $n$, the ``text'', and the task is to preprocess it into a data structure such that given a query string $Q$, one can quickly identify substrings that are within Hamming or edit distance at most $k$ from $Q$. This problem is at the core of various problems arising in biology and text processing.

While exact text indexing allows linear-size data structures with linear query time, text indexing with $k$ mismatches (or $k$ differences) seems to be much harder: All known data structures have exponential dependency on $k$ either in the space, or in the time bound. We provide conditional and pointer-machine lower bounds that make a step toward explaining this phenomenon.

We start by demonstrating lower bounds for $k = \Theta(\log n)$. We show that assuming the Strong Exponential Time Hypothesis, any data structure for text indexing that can be constructed in polynomial time cannot have $O(n^{1-\delta})$ query time, for any $\delta>0$. This bound also extends to the setting where we only ask for $(1+\epsilon)$-approximate solutions for text indexing.

However, in many applications the value of $k$ is rather small, and one might hope that for small~$k$ we can develop more efficient solutions. We show that this would require a radically new approach as using the current methods one cannot avoid exponential dependency on $k$ either in the space, or in the time bound for all even $\frac{8}{\sqrt{3}} \sqrt{\log n} \le k = o(\log n)$. Our lower bounds also apply to the dictionary look-up problem, where instead of a text one is given a set of strings.


Tatiana presented the paper at SODA 2019. I presented the paper briefly in the GT CoA workshop in April 2019, in Roscoff. (Roscoff is a marine biology station, in Britany, hence the seal in the slides.)