Constructing suffix arrays without suffix trees

The initial algorithm described by Manber and Myers for constructing suffix arrays did not rely on suffix trees. Instead, this algorithm used an elegant version of radix sort that we will sketch here. For the full details please refer to the suffix array paper (Manber and Myers 1993).

Our goal is to sort in lexicographic order all the suffixes of a string. While sorting takes $O(n log(n))$ time, we are dealing with strings and also have to account for the time necessary to compare two strings, leading to an $O(n^2 log(n))$ algorithm if implemented naïvely. Note that instead of actually sorting the suffixes, it suffices to assign each suffix an index indicating its position in the sorted list. We will build this index array $I$ throughout the execution of the algorithm.

At a first stage we simply perform a radix sort operation using the first character in each suffix. More precisely, we assign the array $I$ with values $0, 1, 2$ , or $3$ depending on whether the first character in each suffix starts with A, C, G, or T, respectively. We have, thus, implicitly broken up the set of suffixes into a collection of buckets (corresponding to the distinct values of the $I$ array) such that all suffixes in each bucket share the same first character. At the same time we have also filled the first two bits of the final values of the $I$ array.

We will now continue to further break up the initial buckets while filling in more information in the $I$ array. At each stage we will double the number of bits known for each value in the $I$ array. Within each bucket we will also double the size of the prefix shared by all suffixes in the bucket.

To do so we rely on a simple observation. Let us assume that at stage $k$ we have broken up the array into buckets such that all suffixes assigned to a same bucket share a prefix of length $l$ . The order in which these suffixes occur in the final suffix array can be determined by examining the bucket membership of suffixes $i + l$ . To clarify this point, assume we are sorting the suffixes of the string GATTACA. After the first stage, the suffixes will be grouped into buckets:

0 – A, ACA, ATTACA 1 – CA 2 – GATTACA 3 – TACA, TTACA

The order of the three suffixes in bucket $0$ is simply determined by the second character in each of these suffixes, which is implicitly encoded in the bucket membership of the subsequent suffixes. For example, we know ACA occurs before ATTACA because suffix CA is in an earlier bucket than suffix TTACA.

Thus, at each stage $k$ , we can identify the bucket structure by simply concatenating to the $I$ value of each suffix $i$ , the $I$ value from the previous round for the suffix $i + 2^{k-1}$ (assuming we start with stage $0$ ).

Here is the algorithm exemplified on the string GATTACA:

Stage 0: Radix sort based on the first character in each suffix. The index and suffixes are represented in the format: Index - Suffixes

0 - A, ACA, ATTACA 1 - CA 2 - GATTACA 3 - TACA, TTACA

Stage 1: Radix sort based on the first 2 characters in each suffix. The index and suffixes are represented in the format: Index- Suffixes

00 - A 01 - ACA 03 - ATTACA 10 - CA 20 - GATTACA 30 - TACA 33 - TTACA

Stage 2: Radix sort based on the first 4 characters in each suffix. The index and suffixes are represented in the format: Index- Suffixes

0000 - A 0100 - ACA 0330 - ATTACA 1000 - CA 2033 - GATTACA 3010 - TACA 3301 - TTACA

Stage 3: Radix sort based on the first 8 characters in each suffix. The index and suffixes are represented in the format: Index- Suffixes

00000000 - A 01000000 - ACA 03301000 - ATTACA 10000000 - CA 20330100 - GATTACA 30100000 - TACA 33010000 - TTACA

Thus, after just 3 rounds we have managed to sort all the suffixes (though in this case their order did not change after round 1). In general it is easy to see that we will have to execute $O(log(n))$ rounds, and at each round we only spend constant time for each suffix (look up $I$ value for suffix $i + 2^{k-1}$ and append to current $I$ value), for a total runtime of $O(n log(n))$ . While this runtime is slower than the $O(n)$ time for constructing a suffix tree, the entire procedure requires very little extra memory (just the $I$ array). With a bit of extra work we can also fill in the LCP array at the same time as computing the $I$ array.

Note that the way we describe the algorithm above, the number of bits added to the $I$ array is proportional to the length of the string, still yielding an $O(n^2)$ memory footprint. In practice (though harder to explain), the algorithm simply maintains the boundaries of the various buckets within the suffix array and sorts this array in place, not requiring additional memory.

References

Manber, U. and E. W. Myers (1993). Suffix arrays: A New Method for On-Line String Searches. SIAM Journal of Computing 22: 935-948. http://dl.acm.org/citation.cfm?id=320218

PreviousSuffix arrays NextThe Burrows-Wheeler transform and the FM-index

Last updated 1 month ago

hashtagReferences

References