Introduction to string indexing

Introduction

In this chapter we'll focus on different approaches used to index strings or collections of strings such that subsequent queries can be performed more efficiently. When discussing string matching in the context of the Z and KMP algorithms, the Z and sp values were a form of indexing, but the indices were constructed for each pattern separately. More specifically, assume you have a text TT of length nn and a pattern P1P_1 of length m1m_1. The run time required to find P1P_1 inside TT is O(m1+n)O(m_1 + n) for either the Z algorithm or KMP (if this doesn't make sense, please read the chapter again).

Now, let's assume we have k different patterns: P1,...,PkP_1,..., P_k. If we try to search them against TT, we incur a cost of O(mi+n)O(m_i + n) for each pattern ii, leading to a total runtime of O(nk+imi)O(nk + \sum_i m_i) for all kk patterns. We can't easily avoid the second part of this cost (imi\sum_i m_i) as we have to look at every character in each pattern. But can we avoid going through the text kk times? The indexing approaches described in this chapter will help us do so, and also provide some additional useful functionality.

Last updated