The Burrows-Wheeler transform and the FM-index
The Burrows-Wheeler transform
The Burrows-Wheeler transform (BWT) is a datastructure originally developed in the context of data compression (Burrows and Wheeler 1994). We will illustrate this datastructure with a simple example for the string BANANA. Just like in the case of suffix trees, we'll add character $ at the end of the string. We will also assume that this character is lexicographically less than any of the other characters in the alphabet.
The following description is primarily for the purpose of explanation and is not the way the BWT is computed in practice.
The first step in the procedure is to create a table that contains all circular rotations of the original string:
In a second step, we lexicographically sort this array:
Finally, we retain the final column of the array:
Now you can get a bit of intuition about why the transform can be used for compression: the final string appears a lot more compressible (strings of a same character can be easily compressed with simple techniques such as run-length encoding).
Also note that while the construction approach described above appears to be quite convoluted, there is a very natural connection between the Burrows-Wheeler transform and suffix arrays. Specifically, if you remove all characters after the $ in the sorted table you have:
Decoding the Burrows-Wheeler transform
Coming back to the Burrows-Wheeler transform, we have so far presented a simple approach for taking a string, scrambling its letters, and obtaining a new string that is more compressible. This approach would be completely useless if we didn't have a way to unscramble the string, a task that is by no means obvious. Let us examine the sorted BWT table more closely.
The last column is the Burrows-Wheeler transform. The first column is rather un-interesting—it is simply a sorted list of all the characters in the string. Let us number the occurrences of each character in this column (shown with superscripts above). A careful examination of the table shows that the same characters occur in the same order in the last column. This property allows us to undo the transform and reconstruct the original string from just the transformed string. In other words, we can reverse the compression and can thus get a useful compression algorithm (incidentally, the bzip2 unix utility uses this algorithm).
Searching in the Burrows-Wheeler transform
The Burrows-Wheeler transform has so far been useful for compression—can we, however, use it for matching? Clearly, the suffix array is implicit in the Burrows-Wheeler transform. What additional information do we need to perform matching with the BWT? Note that performing matching with the BWT has the potential to give us a very efficient algorithm—the BWT is very efficient in space since we are only storing characters. The suffix array stores integers, which require more than one byte of storage.
To sketch the search algorithm, let us focus on the BANANA example and try to find strings NAN and CAN within this text. We will proceed in the same fashion as we did before, moving from the end of the pattern being matched. When 'unrolling' the transformed string we knew where to start the process, specifically at the $ character. For pattern search, instead, we will simply find all the suffixes that start with N (the last character in the pattern). These are highlighted with arrows below:
We will then examine the last character of these suffixes and retain just those suffixes that have a last character that matches the second to last character of our string. In this case both rows we identified end with the same character (A) which also matches the string. To proceed we must identify the location in the first column where these characters occur (the second and third As respectively, suffixes highlighted with arrows):
Again, we check the last character in the corresponding rows and determine that the string NAN exists in our text (one of the rows ends in N) while the string CAN does not (none of the last characters are C).
In essence the procedure we follow is very similar to a binary search procedure—the search range is always represented in the first column as a contiguous range of characters, all matching the current position reached in the pattern. At each stage we update this range by selecting from the last column the characters that match the next letter in the pattern.
Question: The algorithm described above can easily find whether a match occurs as well as count the number of matches. How can we determine the location of the matches in the text?
The FM-index—balancing memory and speed
Also, the FM-index introduces an additional 'trick' that can reduce both memory and runtime at the same time. As mentioned earlier, the Burrows-Wheeler string is inherently more compressible. The FM-index uses this property to compress the string within each bucket, thereby further reducing the space requirements. Interestingly, depending on the exact compression technique used, operations such as finding the number of characters matching the pattern within a bucket can be performed more efficiently as well. For example, by using run-length encoding, the character counts can be computed directly within the compressed string. Since the runtime depends on bucket size, the computation can, thus, be sped up by compressing the buckets.
Notes
In the original FM-index paper, the authors used a combination of 'move to front' compression and run-length encoding, combination that the authors argue is close to the information theoretic lower bound for compression. The FM-index is, thus, a nice example of a compressed datastructure that allows searches without requiring one to uncompress the data—imagine being able to search inside a .gz file. This line of research is likely to become 'hot' as the computational biology community is trying to deal with the large amounts of data becoming available.
Exercises
Remember that a suffix tree is a compressed representation of all suffixes in a string, such that each suffix is represented by a different leaf in the tree. The least common ancestor of two nodes in a tree is the lowest node shared by the paths from the two nodes to the root. Assume n is the least common ancestor of leaves i and j in a suffix tree for string S. What does this node represent?
Construct the suffix tree for the string: ATATCCATCAT$ Please label each leaf with the corresponding suffix label.
Write the pseudo-code for a recursive function that prints the suffixes (just the labels on the leaves) from a suffix tree in lexicographically sorted order.
Describe how you would use a suffix tree to compute the sp(i) values for the KMP algorithm. How would the algorithm change if you tried to compute the sp'(i) value.
What is the string corresponding to the Burrows-Wheeler Transform shown below?
How would you solve the converse problem of finding the maximum length among all pairwise longest common prefixes?
Given two strings (s1 and s2), design an efficient algorithm that identifies the shortest substring of s1 that is not a substring of s2.
Is the largest value in the LCP array equal to the length of the longest repeat in a string? Please provide a proof supporting your answer.
References
Last updated