The Burrows-Wheeler transform and the FM-index

The Burrows-Wheeler transform

The Burrows-Wheeler transform (BWT) is a datastructure originally developed in the context of data compression (Burrows and Wheeler 1994). We will illustrate this datastructure with a simple example for the string BANANA. Just like in the case of suffix trees, we'll add character $ at the end of the string. We will also assume that this character is lexicographically less than any of the other characters in the alphabet.

The following description is primarily for the purpose of explanation and is not the way the BWT is computed in practice.

The first step in the procedure is to create a table that contains all circular rotations of the original string:

BANANA$
ANANA$B
NANA$BA
ANA$BAN
NA$BANA
A$BANAN
$BANANA

In a second step, we lexicographically sort this array:

$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA
NANA$BA

Finally, we retain the final column of the array:

		$BANANA
		A$BANAN
		ANA$BAN
BANANA$ =>      ANANA$B => ANNB$AA
		BANANA$
		NA$BANA
		NANA$BA

Now you can get a bit of intuition about why the transform can be used for compression: the final string appears a lot more compressible (strings of a same character can be easily compressed with simple techniques such as run-length encoding).

Also note that while the construction approach described above appears to be quite convoluted, there is a very natural connection between the Burrows-Wheeler transform and suffix arrays. Specifically, if you remove all characters after the $ in the sorted table you have:

$
A$
ANA$
ANANA$
BANANA$
NA$
NANA$

This is exactly the suffix array of the string. The Burrows-Wheeler transform itself is the string of characters preceding each of the suffixes. This observation also points out to a way for computing the Burrows-Wheeler transform without needing to build the $O(n^2)$ sized table.

Decoding the Burrows-Wheeler transform

Coming back to the Burrows-Wheeler transform, we have so far presented a simple approach for taking a string, scrambling its letters, and obtaining a new string that is more compressible. This approach would be completely useless if we didn't have a way to unscramble the string, a task that is by no means obvious. Let us examine the sorted BWT table more closely.

\$^1BANANA^1 \\ A^1\$BANAN^1\\A^2NA\$BAN^2\\A^3NANA\$B^1\\B^1ANANA\$^1\\N^1A\$BANA^2\\N^2ANA\$BA^3

The last column is the Burrows-Wheeler transform. The first column is rather un-interesting—it is simply a sorted list of all the characters in the string. Let us number the occurrences of each character in this column (shown with superscripts above). A careful examination of the table shows that the same characters occur in the same order in the last column. This property allows us to undo the transform and reconstruct the original string from just the transformed string. In other words, we can reverse the compression and can thus get a useful compression algorithm (incidentally, the bzip2 unix utility uses this algorithm).

To prove this 'last-first' property it suffices to look at the order in which the characters occur in the last column. Let us focus on just the As (a similar argument can be made for any other string of a same character). Note that $A^1$ is the first character in suffix 6, $A^2$ is the first character in suffix 4, and $A^3$ is the first character in suffix 2. The order in which these characters occur in the last column is determined by the order of the subsequent suffixes in the table: suffixes 7, 5, and 3, respectively. Now, shifting our attention to the first column, the same three characters occur in the same order because the corresponding strings all start with an A followed by the same three suffixes (7, 5, and 3). In other words, the two orderings have to be the same in both columns.

Before we describe the decompression algorithm, note that we need to have available to us the first and the last column of the Burrows-Wheeler table. The last one is the Burrows-Wheeler transform itself, while the first one can be easily computed in $O(n \log n)$ time by sorting the individual characters in the text. Now, the algorithm simply proceeds by starting with the first column in the first row (starting with $), thus identifying the last character in the original string. The last character in the same row (A) is the second to last character in the string. To continue this algorithm we need to identify the same character in the first column, operation that is now easy based on the earlier observation—we simply need to find the first A in the first column, proceed to the last column (now finding an N, the third to last character), and so on. The algorithm in pseudo-code is shown below:

DECODE_BWT(string)

last_col = array(string)  # the Burrows-Wheeler transform as an array
first_col = sort(last_col)  # sorted list of all characters in last column
current_row = 0   # current row in BWT matrix
output = ""
while TRUE :   # endless loop - we'll stop when we reach the end
   if last_col[current_row] == '$': # we reached the end of the string
       print output
       exit()
   prepend last_col[current_row] to output  # fill in output in reverse order
   # next, move from position in last columnn to the corresponding position in first column
   current_row = position of last_col[current_row] in first_col

Searching in the Burrows-Wheeler transform

The Burrows-Wheeler transform has so far been useful for compression—can we, however, use it for matching? Clearly, the suffix array is implicit in the Burrows-Wheeler transform. What additional information do we need to perform matching with the BWT? Note that performing matching with the BWT has the potential to give us a very efficient algorithm—the BWT is very efficient in space since we are only storing characters. The suffix array stores integers, which require more than one byte of storage.

To sketch the search algorithm, let us focus on the BANANA example and try to find strings NAN and CAN within this text. We will proceed in the same fashion as we did before, moving from the end of the pattern being matched. When 'unrolling' the transformed string we knew where to start the process, specifically at the $ character. For pattern search, instead, we will simply find all the suffixes that start with N (the last character in the pattern). These are highlighted with arrows below:

$BANANA
A$BANAN
ANA$BAN
ANANA$B
BANANA$
NA$BANA <=
NANA$BA <=

We will then examine the last character of these suffixes and retain just those suffixes that have a last character that matches the second to last character of our string. In this case both rows we identified end with the same character (A) which also matches the string. To proceed we must identify the location in the first column where these characters occur (the second and third As respectively, suffixes highlighted with arrows):

$BANANA
A$BANAN
ANA$BAN <=
ANANA$B <=
BANANA$
NA$BANA
NANA$BA

Again, we check the last character in the corresponding rows and determine that the string NAN exists in our text (one of the rows ends in N) while the string CAN does not (none of the last characters are C).

In essence the procedure we follow is very similar to a binary search procedure—the search range is always represented in the first column as a contiguous range of characters, all matching the current position reached in the pattern. At each stage we update this range by selecting from the last column the characters that match the next letter in the pattern.

Note that at any stage in the process we need to find the current character in the pattern within the last column of the Burrows-Wheeler table, a process that may require $O(\log n)$ time, leading to an $O(m \log n)$ time runtime for our algorithm. To accelerate this process we can record additional information —for each row we will record the number of each letter preceding that row (see below):

            A B N
$BANANA	    1 0 0 
A$BANAN     1 0 1 
ANA$BAN     1 0 2
ANANA$B     1 1 2
BANANA$     1 1 2
NA$BANA     2 1 2
NANA$BA     3 1 2

This simple information allows us to identify the right range in the first column by simply looking up the current character in the pattern at the boundaries of the current range. For example, when looking for A after having matched N (the first stage described above), we notice that the values of the array for A are 2 and 3, indicating that the new search range corresponds to the second and third As in the first column. Similarly, after having matched AN, we find that only one N character occurs within the search range and that it is the second N in the array. This simple approach results in a runtime of $O(m)$ as we no longer have to perform binary search to find the characters in the array.

Question: The algorithm described above can easily find whether a match occurs as well as count the number of matches. How can we determine the location of the matches in the text?

The simple answer is that we can unroll the text from the location where a match is found in order to identify its position in the text, thereby incurring an additional cost of $O(n)$ . This cost can be overcome by simply storing the coordinate of each suffix as well (in addition to the table of character counts), allowing us to simply look up the position of the match as soon as we determine that a match occurred.

The FM-index—balancing memory and speed

As hinted above, the basic use of the Burrows-Wheeler transform in matching can be slow, requiring $O(m \log(n) + n)$ time to find and identify the location of matches. This is slower than the $O(m + \log(n))$ time required by the suffix array, and even slower than the $O(m)$ time required by the suffix tree. Speeding up the process requires the use of additional memory which essentially transforms the datastructure into a suffix array, thereby offsetting the main memory advantage of the Burrows-Wheeler transform. Can we find a balance between these two extremes?

The basic idea is to only record the additional information (character counts, suffix array information) for specific rows in the Burrows-Wheeler table. Assume we store this information for every bth row, breaking up the array into a collection of "buckets" each comprising b rows. Our algorithm can simply look up the necessary information at bucket boundaries, but needs to spend $O(b)$ time to compute the information within a particular bucket (bounded by the rows where we recorded the information). The total memory requirement is thus reduced to $O(n \log(n) / b)$ , while the runtime is slightly increased to $O(m b + b)$ for matching a pattern of length m (at each step we may have to "undo" the Burrows-Wheeler transform for b rounds until we hit a row where we pre-recorded information). By varying the size of the bucket b we can, thus, choose a different space versus speed tradeoff. This basic approach is the key idea behind the FM-index (Ferragina and Manzini 2000), a search datastructure built upon the Burrows-Wheeler transform.

In the FM-index, the bucket size was chosen to be $b = \log(n)$ in order to minimize storage (which becomes $O(n)$ ) without affecting runtime too much ( $O(m \log(n) + \log(n)$ ). Using this structure, an FM-index representation of the whole human genome can occupy less than 3 Gbp (less than 1 byte per base) as opposed to about 12 Gbp for a suffix array or 60 Gbp for a suffix tree.

Also, the FM-index introduces an additional 'trick' that can reduce both memory and runtime at the same time. As mentioned earlier, the Burrows-Wheeler string is inherently more compressible. The FM-index uses this property to compress the string within each bucket, thereby further reducing the space requirements. Interestingly, depending on the exact compression technique used, operations such as finding the number of characters matching the pattern within a bucket can be performed more efficiently as well. For example, by using run-length encoding, the character counts can be computed directly within the compressed string. Since the runtime depends on bucket size, the computation can, thus, be sped up by compressing the buckets.

Notes

In the original FM-index paper, the authors used a combination of 'move to front' compression and run-length encoding, combination that the authors argue is close to the information theoretic lower bound for compression. The FM-index is, thus, a nice example of a compressed datastructure that allows searches without requiring one to uncompress the data—imagine being able to search inside a .gz file. This line of research is likely to become 'hot' as the computational biology community is trying to deal with the large amounts of data becoming available.

Exercises

Remember that a suffix tree is a compressed representation of all suffixes in a string, such that each suffix is represented by a different leaf in the tree. The least common ancestor of two nodes in a tree is the lowest node shared by the paths from the two nodes to the root. Assume n is the least common ancestor of leaves i and j in a suffix tree for string S. What does this node represent?
Construct the suffix tree for the string: ATATCCATCAT$ Please label each leaf with the corresponding suffix label.
Write the pseudo-code for a recursive function that prints the suffixes (just the labels on the leaves) from a suffix tree in lexicographically sorted order.
Describe how you would use a suffix tree to compute the sp(i) values for the KMP algorithm. How would the algorithm change if you tried to compute the sp'(i) value.
What is the string corresponding to the Burrows-Wheeler Transform shown below?
```
actgcag$ta
```
For any pair of strings, we can compute the length of the longest prefix common to the pair in time linear in their total length. Suppose we are given k strings of total length n and want to compute the minimum length of all the pairwise longest common prefixes over all of the ${k \choose 2}$ pairs of strings, that is, the smallest length of the pairwise pairs. The obvious way to solve this is to solve the longest common prefix problem for each of the pairs in $O(k^2+kn)$ time.
1. Show how to solve this problem in $O(n)$ time, independent of k, using a suffix tree.
2. How would you solve the converse problem of finding the maximum length among all pairwise longest common prefixes?
Given two strings (s1 and s2), design an efficient algorithm that identifies the shortest substring of s1 that is not a substring of s2.
Is the largest value in the LCP array equal to the length of the longest repeat in a string? Please provide a proof supporting your answer.

References

PreviousSuffix arrays NextIntroduction to inexact alignment

Last updated 2 months ago

The Burrows-Wheeler transform and the FM-index