Parallelizing sequence alignment

Today's computers (and even phones) contain multiple processors. Even within a processor parallelism is possible, e.g., by allowing operations to be executed in parallel on different registers or different parts of registers within the processor. Taking advantage of this parallelism can make a major difference in the practical application of expensive algorithms such as those used in sequence alignment.

A full description of parallel algorithms is beyond the scope of this book, however it is important to note two main issues:

  1. Any parallel algorithm can be executed in the same amount of time (defined as runtime * number of processors) on a serial computer. This should be obvious as we can have the same processor pretend to be each of the parallel processors in turn, without needing to account for any additional computational time. As a result, parallelism cannot 'buy' us more than a speed-up proportional to the number of processors used.

  2. The speed-up (ratio of the serial runtime to the parallel runtime) is limited by the amount of serial time spent coordinating between the pp processors used. For a serial runtime of RR, the runtime in parallel on pp processors, R(p)>=R/pR(p) >= R/p, with equality only when there is no communication/coordination between the processors.

In other words, parallel algorithms are not a universal panacea, and can only reduce the runtime by a (possibly large) constant determined by the number of processors used. This constant improvement can, however, be quite important in practice, e.g., it could represent the difference between running a program for one month versus one day.

The dynamic programming alignment algorithm provides nice opportunities for parallelism as the data dependencies are quite limited. Specifically, the three dynamic programming tables for affine gap processing, EE, FF, and GG, each rely on information from the adjacent column, same column, and shifted adjacent column, respectively, as seen in the figure below. Entire group of cells can, thus, be computed in a single operation, assuming the computer being used allows such parallel operations (most modern processors do), also called SIMD instructions (single instruction multiple data).

Rectangle representing a dynamic programming table for the alignment of two strings. Three columns comprising blue vertical rectangles highlight the transfer of information within the three dynamic programming matrices used to perform alignment with affine gap penalties.  The column labeled E highlights with a set of horizontal arrow that the information is copied from each column to the adjacent one. The second column labeled G uses arrows pointing diagonally to the right and down to indicate that the transfer of information between adjacent columns occurs diagonally. The third column labeled F contains arrows pointing down indicating the flow of information is along each column.
Transfer of information in the dynamic programming alignment algorithm. Within matrix E, data are transferred horizontally from each column to the adjacent one. Within matrix G, data are transferred diagonally to the adjacent column. Within matrix F, data are transferred down along the same column. Blocks of these matrices can be processed in parallel, speeding up execution.

Of course things are not quite as easy as outlined here, rather implementing an efficient parallel algorithm requires a number of 'tricks' (see (Farrar 2007, Rognes 2011) for a more extensive description). First, note that when computing the table GG, we must identify, for each character in the query S1[i]S1[i], the score S(S1[i],S2[j])S(S1[i], S2[j]) of aligning it to the corresponding character S2[j]S2[j] in the reference. At first glance, this lookup cannot be performed in parallel. Given that the alphabet is limited in size (20 letters for amino-acids, 4 for DNA), we can, however record a table of size O(mΣ)O(m |\Sigma|) which records the substitution scores at each position in the query for all possible choices of characters in the reference. This table can be precomputed at the beginning of the algorithm and a single lookup suffices to identify in constant time an entire block of scores.

A second issue we encounter relates to the idiosyncrasies of the parallel architecture available to us. For example, the size of the values stored in the dynamic programming table affects the number of operations we can do in parallel. For example, if our processor can perform in parallel integer operations across 128 bits, we can process 16 cells at the same time as long as the dynamic programming score can be represented in 8 bits. If 16 bits are needed, the level of parallelism drops to 8.

Thirdly, we can run into more issues due to the fact that our algorithm must choose the largest value from among three possible values. Such decision branches can affect the ability of the compiler to generate code that gets executed efficiently, offsetting some of the speedup that we might be able to achieve.

References

Farrar, M. (2007). Striped Smith-Waterman speeds database searches six times over other SIMD implementations. Bioinformatics 23(2): 156-161. http://www.ncbi.nlm.nih.gov/pubmed/17110365arrow-up-right

Rognes, T. (2011). Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinformatics 12: 221. http://www.ncbi.nlm.nih.gov/pubmed/21631914arrow-up-right

Last updated