In this paper we study lower bounds for the fundamental problem
of text indexing with mismatches and differences. In this
problem we are given a long string of length $n$, the ``text'',
and the task is to preprocess it into a data structure such that
given a query string $Q$, one can quickly identify substrings
that are within Hamming or edit distance at most $k$ from $Q$.
This problem is at the core of various problems arising in
biology and text processing.
While exact text indexing allows linear-size data structures
with linear query time, text indexing with $k$ mismatches (or
$k$ differences) seems to be much harder: All known data
structures have exponential dependency on $k$ either in the
space, or in the time bound. We provide conditional and
pointer-machine lower bounds that make a step toward explaining
this phenomenon.
We start by demonstrating lower bounds for $k = \Theta(\log n)$.
We show that assuming the Strong Exponential Time Hypothesis,
any data structure for text indexing that can be constructed in
polynomial time cannot have $O(n^{1-\delta})$ query time, for
any $\delta>0$. This bound also extends to the setting where we
only ask for $(1+\epsilon)$-approximate solutions for text indexing.
However, in many applications the value of $k$ is rather small,
and one might hope that for small~$k$ we can develop more
efficient solutions. We show that this would require a radically
new approach as using the current methods one cannot avoid
exponential dependency on $k$ either in the space, or in the
time bound for all even
$\frac{8}{\sqrt{3}} \sqrt{\log n} \le k = o(\log n)$. Our lower
bounds also apply to the dictionary look-up problem, where
instead of a text one is given a set of strings.