# Pattern matching with multiple patterns: the Aho-Corasick algorithm

## Introduction, keyword trees

So far we have managed to solve the exact matching problem efficiently when we only need to search one pattern against a text. What happens, however, if we want to match multiple patterns against a same text? Can we do better than matching each character to the text separately? This situation occurs quite commonly, e.g., when mapping all the reads generated in a sequencing experiment against a genome.

Let us assume we have $$z$$ patterns, each of length $$m$$, and we want to match them against a text of length $$n$$. The trivial algorithm leads to a runtime of $$O(z(n + m))$$, which can be prohibitive for large values of $$z$$ (e.g., millions of sequences matched to a genome).

A first intuition behind the Aho-Corasick algorithm is the idea that we should exploit similarities between the patterns as much as possible, the same way we've exploited self-similarity within the patterns in KMP and Boyer-Moore. The data structure we'll use is called a **keyword tree** – just a fancy word to say we'll construct a tree structure that merges together patterns which have common prefixes (see below for patterns $$abcq, abcr, bcde, cd$$). Each edge in this tree is labeled with one single character, and every path from the root to a leaf spells the sequence of one of the patterns.

<figure><img src="https://1026195329-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FtFotA8sod0spM4AcHUQc%2Fuploads%2FyYzar5aEycDxzr19pcAu%2Fkeyword%20tree.png?alt=media&#x26;token=7c72bfd8-7f1d-4428-997c-afa15bc3b090" alt="Drawing of a keyword tree - each path spells a different pattern. Starting from the root, the left path spells a, b, c, then splits into two separate paths labeled q and r. The second path from the root spells b, c, d, e. The third path from the root spells c, d. " width="174"><figcaption><p>A keyword tree storing strings abcq, abcr, bcde, cd. Each path from the root to a leaf spells one of the strings, with the common prefixes shared on the same path (e.g., abc shared between abcq and abcr). </p></figcaption></figure>

This structure is just a (slightly) compressed way to store the patterns together. Having the shared prefixes along the same path in the tree should also allow us to simultaneously match (up to a point) multiple patterns, though in the worst case this ability alone does not allow us to improve on the naïve search of each pattern against the text.

So, how would the matching process work with a keyword tree? We simply align the root of the tree to some position in the text, then follow the path indicated by the corresponding letters in the text until we either reach a leaf (i.e., one of the patterns matched) or find a mismatch. In the later case we simply shift the position of the root in the text by one character and restart the alignment. This is clearly a worse algorithm than running KMP on each string independently, as I've just described a tree-based version of the quadratic-time naïve matching algorithm. The only "cool" thing about this approach is that it automatically picks the pattern that could match at each location, so at least we don't have to incur the additional cost of trying all the patterns. The runtime should, thus, be $$O(nm)$$ irrespective of the number $$z$$ of patterns.&#x20;

## Redefining sp values for trees - failure links

The key insight behind speeding up the algorithm is the idea that we may be able to perform the [KMP preprocessing](https://mpop.gitbook.io/bioinformatics-lecture-notes/exact-string-matching/the-kmp-algorithm) approach on all the patterns simultaneously. To better understand this idea, think of simply following one of the paths in the tree until we hit a mismatch. For one of the patterns $$P$$ along that path we already know how to compute the[ `sp[i]` values](https://mpop.gitbook.io/bioinformatics-lecture-notes/the-kmp-algorithm#accelerating-pattern-shifting-with-sp-values), and thus can shift the pattern (and implicitly the tree) to the position indicated by the corresponding `sp` value. Would this shift, however, be correct? By the definition of the `sp` values we know that as far as pattern $$P$$ is concerned, shifting by the amount indicated by the `sp` value will not miss any other matches of $$P$$. What we cannot guarantee is that we do not miss matches of another pattern $$Q$$ in the set. What we are, thus, looking for, is a way to extend the `sp` values to the entire set of the patterns. Specifically, we will try to find the longest prefix of some pattern (longest path starting at the root) that matches a suffix of the path we have already matched. We will encode this information as links between nodes in the tree.

More formally, we will define a collection of **failure links** that, for every node $$N$$ in the tree, connect $$N$$ to another node $$M$$ which has the following properties:

* the path from root to $$M$$ spells a suffix of the path from root to $$N$$;&#x20;
* $$M$$ is the deepest such node (the longest suffix-prefix match in the tree).&#x20;

If no such node exists we simply link $$N$$ to the root of the tree. An example is shown in the figure below for the keyword tree we have presented earlier.

<figure><img src="https://1026195329-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FtFotA8sod0spM4AcHUQc%2Fuploads%2FMnZrPrlIN4RBe496ybXW%2FAC%20failure%20link.png?alt=media&#x26;token=e33d691e-e088-4801-8d36-0e32a3220187" alt="Drawing of a keyword tree - each path spells a different pattern. Starting from the root, the left path spells a, b, c, then splits into two separate paths labeled q and r. The second path from the root spells b, c, d, e. The third path from the root spells c, d.  A magenta edge connects the node labeled a,b,c to the node labeled b,c on the adjacent path. Two gray boxes highlight the shared suffix of the paths connected by the magenta edge - the string b,c." width="174"><figcaption><p>Failure link in a keyword tree. The magenta arrow connects two nodes such that the suffix of the path to the source node matches the label of the target node. In this case, the shared suffix spells "b,c". </p></figcaption></figure>

So how can we use this information to align the tree to the text? As in the naïve algorithm, we start with the root of the tree aligned to the beginning of the text and match the text and a path in the tree with each other while advancing a pointer in both of them. When we encounter a mismatch, we simply follow the failure link in the tree, and restart the matching process from the new location, without advancing the pointer to the current location in the text, the same way we have done in the KMP algorithm. The operation effectively moves the root of the tree down the text such that the prefix of the pattern(s) within the subtree rooted at the end of the failure link matches the text.&#x20;

It might help to work through a few examples but you should be able to see that we never need to 'rewind' within the text as we'll always be able to guarantee that the path from the root to the current node in the tree already matches the text (due to the definition of the failure links).

The runtime of this algorithm (ignoring the time needed to compute the failure links) is simply $$O(n + m)$$ (rather than $$O(n + zm))$$, following the same type of argument we used when [leveraging KMP sp values](https://mpop.gitbook.io/bioinformatics-lecture-notes/the-kmp-algorithm#accelerating-pattern-shifting-with-sp-values). Clearly every time we match a character in the text we never examine that character again. We could, however, have many mismatches at any specific position in the text. The number of such mismatches can however be bounded by the total matches made so far, as follows. Every time we have a match we go deeper into the tree. When we find a mismatch we follow a failure link which, by definition, pushes us higher up in the tree. As we cannot go higher than the root of the tree, a mismatch will not cost us more than the amount of matches we've already 'banked'.

## Computing the failure links

Can the failure links, however, also be efficiently computed? The answer is yes – we can simply use essentially the same algorithm used to construct the KMP sp values.

To formalize the construction of the tree, here is an outline of the construction algorithm:

1. **Constructing the basic tree structure** – this stage entails searching, for each pattern, the branch that represents its longest common prefix with some other pattern that had already been inserted in the tree. If the pattern is not fully contained in another pattern that is already in the tree, we also need to add a new edge and subsequent path of nodes leading to a leaf that corresponds to the newly added pattern. For example, in the tree shown in the figures earlier in this chapter if we assume string $$abcq$$ was already in the tree, we insert string $$abcr$$ by following the common prefix $$abc$$, then inserting a new edge labeled $$r$$.
2. **Adding the failure links** – after the basic tree is constructed, we visit each tree node in breadth-first-search order, and a failure link is added to every node. In this procedure, first a failure link pointing to the root is added to all the direct descendants of the root. Next, in the every phase, when visiting a node $$v$$ we will check whether the last character of the string represented by this node is equal to the character that follows the prefix represented by the node $$link(p(v))$$.  $$p(v)$$ denotes the parent of $$v$$ and $$link(p(v))$$ denotes the node pointed by the failure link in $$p(v)$$. In that case we simply add a link from $$v$$ to the node representing the character following $$link(p(v))$$. If the letters do not match, we begin a series of link transitions (after visiting the node $$link(p(v))$$ we will go to $$link(link(p(v)))$$ and so on). After each such visit we compare the letter following the node at which we have arrived to the letter labeling the edge into node $$v$$. If a match is found following some node $$w$$ that was reached by a path of failure links from $$v$$, we set $$link(v)$$ to be the corresponding child of $$w$$. Otherwise, we continue the series until we reach the root, in which case the failure link from $$v$$ will point to the root.

An alternate construction approach can perform both steps at the same time by adding the characters in the patterns to the trie one by one, effectively performing a breadth-first construction of the tree. Specifically, we first add the first character of each of the patterns, then the second, and so on, and also construct the failure links at the same time as we construct the corresponding nodes.

We can prove that this algorithm is linear time ($$O(zm)$$ due to the size of the pattern tree) in a similar way to the traditional computation of the `sp` values in KMP. Specifically, we define the potential function to be the depth of the node which we are currently visiting, and the actual cost of the operation as the number of links we have to go through in order to define the current link of a node. Since every link decreases our depth by at least one, for each link we visit we decrease the potential by one. In that way it is easy to prove that the overall running time is indeed linear with respect to the total length of the patterns.

## Dealing with patterns contained in other patterns

One situation we have glossed over is that of a pattern that is a proper substring of another one. In the tree we used as an example above, see, for example, pattern $$cd$$ which is also found as a substring in pattern $$bcde$$. Following the algorithm outlined above blindly we might miss the match of the smaller pattern. For example, assume that we matched the string $$bcd$$ and mismatched on character $$e$$. The failure link would bring us to the leaf labeled $$cd$$, yet our algorithm would not report a match. Of course, a simple solution is to report a match anytime we reach a leaf, whether by matching a character or by following a failure link.

This simple solution, however, can be very expensive, as we would have to follow the failure links from every node just to make sure we haven't missed any matches. To understand that, imagine that the full pattern $$bcde$$ from our tree matches the text. Once we skip past the node labeled $$bcd$$, we lose the opportunity to follow a link to the end of pattern $$cd$$, thus failing to determine that it also matched the text. To find that match we would have to follow the failure link from the node labeled $$bcd$$ even though we didn't mismatch at that location. Such link transitions cannot be amortized against earlier matches and can, thus, lead to inefficient runtime.

To address this performance issue, we will store a new set of links – **output links** – which connect a node to all the leaves that should be reported once we reach that node.

**Question:** Can the output links be computed efficiently?  (You should be able to answer this question using a similar argument to the one used during the construction of failure links)

Now, why do output links help speed up the algorithm? At first glance, we still spend time following links to all the patterns that match. But that's exactly the point – we only follow output links if we output a match, so the work we do is justified. The resulting runtime is $$O(n + k)$$ where $$k$$ is the number of matches found during the duration of the execution, or $$O(n + mz + k)$$ if we include preprocessing time. This situation highlights another interesting parameter in evaluating the performance of algorithms – the dependency of the runtime on the size of the output. This is also another knob we can tune as we try to find the best tradeoff between efficiency and utility. Algorithms where the runtime depends on the size of the output are called **output-sensitive**.&#x20;

## Matching with wildcards

So far we have assumed that all characters in the pattern must all match. We can also envisage a situation where we want to allow mismatches at specific locations in the pattern. For example, we might want to match the following DNA string, ACCATGNTAGGNCGANNTAGGC, where the character N can match any of the four nucleotides. The trivial solution to this problem involves generating all possible such strings (there are 64 of them) then reporting a match when any one of them matches. Of course we can use the Aho-Corasick algorithm for this process, however the size of the pattern set grows exponentially with the number of 'don't care' symbols we include.

A better approach involves simply breaking the pattern into a collection of subpatterns comprising all strings between the N characters (in our case ACCATG, TAGG, CGA, TAGGC). We then use the Aho-Corasick algorithm to find all locations in the text where these patterns match. Matches of the whole original pattern will be easily recognizable from the spacing of the individual matches. For example, the individual patterns may match at positions 12,19, 24, 29 in the example above.

## Exercises

1. Work out the details of the proof that the construction of the Aho-Corasick trie, including failure links, can be done in time proportional to the total size of the pattern set.
2. Write the pseudo-code for constructing the output links and outline an argument why this construction is linear time.