Constructing suffix arrays without suffix trees

The initial algorithm described by Manber and Myers for constructing suffix arrays did not rely on suffix trees. Instead, this algorithm used an elegant version of radix sort that we will sketch here. For the full details please refer to the suffix array paper (Manber and Myers 1993).

Our goal is to sort in lexicographic order all the suffixes of a string. While sorting takes O(nlog(n))O(n log(n)) time, we are dealing with strings and also have to account for the time necessary to compare two strings, leading to an O(n2log(n))O(n^2 log(n)) algorithm if implemented naïvely. Note that instead of actually sorting the suffixes, it suffices to assign each suffix an index indicating its position in the sorted list. We will build this index array II throughout the execution of the algorithm.

At a first stage we simply perform a radix sort operation using the first character in each suffix. More precisely, we assign the array II with values 0,1,20, 1, 2, or 33 depending on whether the first character in each suffix starts with A, C, G, or T, respectively. We have, thus, implicitly broken up the set of suffixes into a collection of buckets (corresponding to the distinct values of the II array) such that all suffixes in each bucket share the same first character. At the same time we have also filled the first two bits of the final values of the II array.

We will now continue to further break up the initial buckets while filling in more information in the II array. At each stage we will double the number of bits known for each value in the II array. Within each bucket we will also double the size of the prefix shared by all suffixes in the bucket.

To do so we rely on a simple observation. Let us assume that at stage kk we have broken up the array into buckets such that all suffixes assigned to a same bucket share a prefix of length ll. The order in which these suffixes occur in the final suffix array can be determined by examining the bucket membership of suffixes i+li + l. To clarify this point, assume we are sorting the suffixes of the string GATTACA. After the first stage, the suffixes will be grouped into buckets:

0 – A, ACA, ATTACA 1 – CA 2 – GATTACA 3 – TACA, TTACA

The order of the three suffixes in bucket 00 is simply determined by the second character in each of these suffixes, which is implicitly encoded in the bucket membership of the subsequent suffixes. For example, we know ACA occurs before ATTACA because suffix CA is in an earlier bucket than suffix TTACA.

Thus, at each stage kk, we can identify the bucket structure by simply concatenating to the II value of each suffix ii, the II value from the previous round for the suffix i+2k1i + 2^{k-1} (assuming we start with stage 00).

Here is the algorithm exemplified on the string GATTACA:

Stage 0: Radix sort based on the first character in each suffix. The index and suffixes are represented in the format: Index - Suffixes

0 - A, ACA, ATTACA 1 - CA 2 - GATTACA 3 - TACA, TTACA

Stage 1: Radix sort based on the first 2 characters in each suffix. The index and suffixes are represented in the format: Index- Suffixes

00 - A 01 - ACA 03 - ATTACA 10 - CA 20 - GATTACA 30 - TACA 33 - TTACA

Stage 2: Radix sort based on the first 4 characters in each suffix. The index and suffixes are represented in the format: Index- Suffixes

0000 - A 0100 - ACA 0330 - ATTACA 1000 - CA 2033 - GATTACA 3010 - TACA 3301 - TTACA

Stage 3: Radix sort based on the first 8 characters in each suffix. The index and suffixes are represented in the format: Index- Suffixes

00000000 - A 01000000 - ACA 03301000 - ATTACA 10000000 - CA 20330100 - GATTACA 30100000 - TACA 33010000 - TTACA

Thus, after just 3 rounds we have managed to sort all the suffixes (though in this case their order did not change after round 1). In general it is easy to see that we will have to execute O(log(n))O(log(n)) rounds, and at each round we only spend constant time for each suffix (look up II value for suffix i+2k1i + 2^{k-1} and append to current II value), for a total runtime of O(nlog(n))O(n log(n)). While this runtime is slower than the O(n)O(n) time for constructing a suffix tree, the entire procedure requires very little extra memory (just the II array). With a bit of extra work we can also fill in the LCP array at the same time as computing the II array.

Note that the way we describe the algorithm above, the number of bits added to the II array is proportional to the length of the string, still yielding an O(n2)O(n^2) memory footprint. In practice (though harder to explain), the algorithm simply maintains the boundaries of the various buckets within the suffix array and sorts this array in place, not requiring additional memory.

References

Manber, U. and E. W. Myers (1993). Suffix arrays: A New Method for On-Line String Searches. SIAM Journal of Computing 22: 935-948. http://dl.acm.org/citation.cfm?id=320218arrow-up-right

Last updated