Pattern matching with multiple patterns: the Aho-Corasick algorithm

Introduction, keyword trees

So far we have managed to solve the exact matching problem efficiently when we only need to search one pattern against a text. What happens, however, if we want to match multiple patterns against a same text? Can we do better than matching each character to the text separately? This situation occurs quite commonly, e.g., when mapping all the reads generated in a sequencing experiment against a genome.

Let us assume we have zz patterns, each of length mm, and we want to match them against a text of length nn. The trivial algorithm leads to a runtime of O(z(n+m))O(z(n + m)), which can be prohibitive for large values of zz (e.g., millions of sequences matched to a genome).

A first intuition behind the Aho-Corasick algorithm is the idea that we should exploit similarities between the patterns as much as possible, the same way we've exploited self-similarity within the patterns in KMP and Boyer-Moore. The data structure we'll use is called a keyword tree – just a fancy word to say we'll construct a tree structure that merges together patterns which have common prefixes (see below for patterns abcq,abcr,bcde,cdabcq, abcr, bcde, cd). Each edge in this tree is labeled with one single character, and every path from the root to a leaf spells the sequence of one of the patterns.

Drawing of a keyword tree - each path spells a different pattern. Starting from the root, the left path spells a, b, c, then splits into two separate paths labeled q and r. The second path from the root spells b, c, d, e. The third path from the root spells c, d.
A keyword tree storing strings abcq, abcr, bcde, cd. Each path from the root to a leaf spells one of the strings, with the common prefixes shared on the same path (e.g., abc shared between abcq and abcr).

This structure is just a (slightly) compressed way to store the patterns together. Having the shared prefixes along the same path in the tree should also allow us to simultaneously match (up to a point) multiple patterns, though in the worst case this ability alone does not allow us to improve on the naïve search of each pattern against the text.

So, how would the matching process work with a keyword tree? We simply align the root of the tree to some position in the text, then follow the path indicated by the corresponding letters in the text until we either reach a leaf (i.e., one of the patterns matched) or find a mismatch. In the later case we simply shift the position of the root in the text by one character and restart the alignment. This is clearly a worse algorithm than running KMP on each string independently, as I've just described a tree-based version of the quadratic-time naïve matching algorithm. The only "cool" thing about this approach is that it automatically picks the pattern that could match at each location, so at least we don't have to incur the additional cost of trying all the patterns. The runtime should, thus, be O(nm)O(nm) irrespective of the number zz of patterns.

The key insight behind speeding up the algorithm is the idea that we may be able to perform the KMP preprocessing approach on all the patterns simultaneously. To better understand this idea, think of simply following one of the paths in the tree until we hit a mismatch. For one of the patterns PP along that path we already know how to compute the sp[i] values, and thus can shift the pattern (and implicitly the tree) to the position indicated by the corresponding sp value. Would this shift, however, be correct? By the definition of the sp values we know that as far as pattern PP is concerned, shifting by the amount indicated by the sp value will not miss any other matches of PP. What we cannot guarantee is that we do not miss matches of another pattern QQ in the set. What we are, thus, looking for, is a way to extend the sp values to the entire set of the patterns. Specifically, we will try to find the longest prefix of some pattern (longest path starting at the root) that matches a suffix of the path we have already matched. We will encode this information as links between nodes in the tree.

More formally, we will define a collection of failure links that, for every node NN in the tree, connect NN to another node MM which has the following properties:

  • the path from root to MM spells a suffix of the path from root to NN;

  • MM is the deepest such node (the longest suffix-prefix match in the tree).

If no such node exists we simply link NN to the root of the tree. An example is shown in the figure below for the keyword tree we have presented earlier.

Drawing of a keyword tree - each path spells a different pattern. Starting from the root, the left path spells a, b, c, then splits into two separate paths labeled q and r. The second path from the root spells b, c, d, e. The third path from the root spells c, d.  A magenta edge connects the node labeled a,b,c to the node labeled b,c on the adjacent path. Two gray boxes highlight the shared suffix of the paths connected by the magenta edge - the string b,c.
Failure link in a keyword tree. The magenta arrow connects two nodes such that the suffix of the path to the source node matches the label of the target node. In this case, the shared suffix spells "b,c".

So how can we use this information to align the tree to the text? As in the naïve algorithm, we start with the root of the tree aligned to the beginning of the text and match the text and a path in the tree with each other while advancing a pointer in both of them. When we encounter a mismatch, we simply follow the failure link in the tree, and restart the matching process from the new location, without advancing the pointer to the current location in the text, the same way we have done in the KMP algorithm. The operation effectively moves the root of the tree down the text such that the prefix of the pattern(s) within the subtree rooted at the end of the failure link matches the text.

It might help to work through a few examples but you should be able to see that we never need to 'rewind' within the text as we'll always be able to guarantee that the path from the root to the current node in the tree already matches the text (due to the definition of the failure links).

The runtime of this algorithm (ignoring the time needed to compute the failure links) is simply O(n+m)O(n + m) (rather than O(n+zm))O(n + zm)), following the same type of argument we used when leveraging KMP sp values. Clearly every time we match a character in the text we never examine that character again. We could, however, have many mismatches at any specific position in the text. The number of such mismatches can however be bounded by the total matches made so far, as follows. Every time we have a match we go deeper into the tree. When we find a mismatch we follow a failure link which, by definition, pushes us higher up in the tree. As we cannot go higher than the root of the tree, a mismatch will not cost us more than the amount of matches we've already 'banked'.

Can the failure links, however, also be efficiently computed? The answer is yes – we can simply use essentially the same algorithm used to construct the KMP sp values.

To formalize the construction of the tree, here is an outline of the construction algorithm:

  1. Constructing the basic tree structure – this stage entails searching, for each pattern, the branch that represents its longest common prefix with some other pattern that had already been inserted in the tree. If the pattern is not fully contained in another pattern that is already in the tree, we also need to add a new edge and subsequent path of nodes leading to a leaf that corresponds to the newly added pattern. For example, in the tree shown in the figures earlier in this chapter if we assume string abcqabcq was already in the tree, we insert string abcrabcr by following the common prefix abcabc, then inserting a new edge labeled rr.

  2. Adding the failure links – after the basic tree is constructed, we visit each tree node in breadth-first-search order, and a failure link is added to every node. In this procedure, first a failure link pointing to the root is added to all the direct descendants of the root. Next, in the every phase, when visiting a node vv we will check whether the last character of the string represented by this node is equal to the character that follows the prefix represented by the node link(p(v))link(p(v)). p(v)p(v) denotes the parent of vv and link(p(v))link(p(v)) denotes the node pointed by the failure link in p(v)p(v). In that case we simply add a link from vv to the node representing the character following link(p(v))link(p(v)). If the letters do not match, we begin a series of link transitions (after visiting the node link(p(v))link(p(v)) we will go to link(link(p(v)))link(link(p(v))) and so on). After each such visit we compare the letter following the node at which we have arrived to the letter labeling the edge into node vv. If a match is found following some node ww that was reached by a path of failure links from vv, we set link(v)link(v) to be the corresponding child of ww. Otherwise, we continue the series until we reach the root, in which case the failure link from vv will point to the root.

An alternate construction approach can perform both steps at the same time by adding the characters in the patterns to the trie one by one, effectively performing a breadth-first construction of the tree. Specifically, we first add the first character of each of the patterns, then the second, and so on, and also construct the failure links at the same time as we construct the corresponding nodes.

We can prove that this algorithm is linear time (O(zm)O(zm) due to the size of the pattern tree) in a similar way to the traditional computation of the sp values in KMP. Specifically, we define the potential function to be the depth of the node which we are currently visiting, and the actual cost of the operation as the number of links we have to go through in order to define the current link of a node. Since every link decreases our depth by at least one, for each link we visit we decrease the potential by one. In that way it is easy to prove that the overall running time is indeed linear with respect to the total length of the patterns.

Dealing with patterns contained in other patterns

One situation we have glossed over is that of a pattern that is a proper substring of another one. In the tree we used as an example above, see, for example, pattern cdcd which is also found as a substring in pattern bcdebcde. Following the algorithm outlined above blindly we might miss the match of the smaller pattern. For example, assume that we matched the string bcdbcd and mismatched on character ee. The failure link would bring us to the leaf labeled cdcd, yet our algorithm would not report a match. Of course, a simple solution is to report a match anytime we reach a leaf, whether by matching a character or by following a failure link.

This simple solution, however, can be very expensive, as we would have to follow the failure links from every node just to make sure we haven't missed any matches. To understand that, imagine that the full pattern bcdebcde from our tree matches the text. Once we skip past the node labeled bcdbcd, we lose the opportunity to follow a link to the end of pattern cdcd, thus failing to determine that it also matched the text. To find that match we would have to follow the failure link from the node labeled bcdbcd even though we didn't mismatch at that location. Such link transitions cannot be amortized against earlier matches and can, thus, lead to inefficient runtime.

To address this performance issue, we will store a new set of links – output links – which connect a node to all the leaves that should be reported once we reach that node.

Question: Can the output links be computed efficiently? (You should be able to answer this question using a similar argument to the one used during the construction of failure links)

Now, why do output links help speed up the algorithm? At first glance, we still spend time following links to all the patterns that match. But that's exactly the point – we only follow output links if we output a match, so the work we do is justified. The resulting runtime is O(n+k)O(n + k) where kk is the number of matches found during the duration of the execution, or O(n+mz+k)O(n + mz + k) if we include preprocessing time. This situation highlights another interesting parameter in evaluating the performance of algorithms – the dependency of the runtime on the size of the output. This is also another knob we can tune as we try to find the best tradeoff between efficiency and utility. Algorithms where the runtime depends on the size of the output are called output-sensitive.

Matching with wildcards

So far we have assumed that all characters in the pattern must all match. We can also envisage a situation where we want to allow mismatches at specific locations in the pattern. For example, we might want to match the following DNA string, ACCATGNTAGGNCGANNTAGGC, where the character N can match any of the four nucleotides. The trivial solution to this problem involves generating all possible such strings (there are 64 of them) then reporting a match when any one of them matches. Of course we can use the Aho-Corasick algorithm for this process, however the size of the pattern set grows exponentially with the number of 'don't care' symbols we include.

A better approach involves simply breaking the pattern into a collection of subpatterns comprising all strings between the N characters (in our case ACCATG, TAGG, CGA, TAGGC). We then use the Aho-Corasick algorithm to find all locations in the text where these patterns match. Matches of the whole original pattern will be easily recognizable from the spacing of the individual matches. For example, the individual patterns may match at positions 12,19, 24, 29 in the example above.

Exercises

  1. Work out the details of the proof that the construction of the Aho-Corasick trie, including failure links, can be done in time proportional to the total size of the pattern set.

  2. Write the pseudo-code for constructing the output links and outline an argument why this construction is linear time.

Last updated