Looking for frequent k-mers
Last updated
Last updated
While you may be try to count DNA letters, you are unlikely to get much useful information. Unlike English messages, messages written in DNA can be quite boring since you only have a 4 letter alphabet. If you count the numbers of As, Cs, Gs, and Ts in a genome you will, however, make some interesting observation. First of all, the complementary bases will occur in roughly the same amount. That is, you will see about the same number of As as Ts, and about the same number of Cs as Gs. At a very high level this makes some sense since these are the letters that pair up in a double-stranded molecule. However, you are looking here at a single strand, so pairing doesn't play a role. You may struggle for a bit to figure out what mechanism leads to this observation, but I'll save you a bit of time: nobody really knows what biological process ensures that each strand of the DNA has equal numbers of the paired nucleotides.
So, counting letters yields some interesting observations, but not much that can be used to figure out the location of the origin of replication.
I mentioned earlier that DNA is "boring" compared to English because there are a lot fewer letters you can use. What if you formed "words" out of DNA? Since we do not understand the language of DNA we won't know where words start and end, but perhaps we can come up with an approximation. A common way to do this is to assume all words have the same length k. We call each such "word" a k-mer.
While we only have a small alphabet, the number of k-mers is much larger. Using the 4 DNA letters, we can form different words. It is now unlikely that these words will all occur with the same frequency in the genome, and perhaps you will observe some interesting things.
But before we proceed to what may be interesting, let us spend a bit of time on figuring out how to count k-mers. First, some notations. Let G denote the genome of interest, and k the length of a k-mer. The pseudo-code of an algorithm we may use looks like:
First, a few quick observations. The pseudo-code is not exactly correct, and you have to think a bit more carefully about which positions in G can be the start of k-mers. Do you go all the way to the end of G or do you stop earlier? Second, the actual function for finding a substring differs between programming languages, and depending on the language that you use, you may have to specify a start and an end, or a start and a length. Also, you may have to worry about whether the substring includes the letter corresponding to the first (or last) coordinate, or if it starts right after or right before the coordinate passed to the function. Figuring all of this out will require you to write a short bit of code and work through a few short examples to make sure you are getting the substrings that you want.
The observation that is more relevant for this section, however, is that in many programming languages you are simply not allowed to use a string as the key/index in a dictionary. The pseudo-code segment count[kmer]
may trigger an error. Even if strings are allowed as keys for dictionaries, using string keys may result in inefficient code. To address this issue, computational biologists usually encode DNA into integers as described below.
First, a quick reminder that computers "think" in binary code, and every information stored in a computer ultimately can be represented as a string of 0s and 1s. The number of bits necessary to represent each of the members of some set S of objects (say, all lowercase letters in the English language) is simply . For example, to store each of the 26 letters in the English alphabet we need . Once you add upper case and lower case letters as well as punctuation, we end up needing 7 bits of information which is exactly what the ASCII code uses, representing 128 possible "code points" or symbols. DNA is, of course, much simpler, and we can represent the DNA alphabet with just 2 bits per letter () .
By convention, we encode the letters as follows:
A
00
C
01
G
10
T
11
To convert a k-mer to an integer, we simply change each of the letters into the corresponding bits. A convention we use here is that the k-mers and bits are read just as we read numbers, with the most significant bit on the left side, however this is just a convention. One can use other encodings, as long as they are used consistently.
An example is in order. Let us convert the 4-mer GAGT:
If you don't remember how to convert from binary to decimal, here's a quick refresher:
While encoding the string didn't require us to know the value of k, to "decode" a number that we know represents a k-mer, we do need to know this value, otherwise we could interpret any string of 0s as a string of half as many As (since the encoding of A is 00), i.e., the number 0 could encode any arbitrarily long string of As.
Thus, let's assume that we want to find out which 4-mer is encoded by the number 89, we convert it to binary, making sure we retain 8 bits (2 for each of the letters in the 4-mer): 01011001, then we "read" this string in blocks of two, referring to the table above.
Thus, the 4-mer encoded in the number 89 is CCGC.
At this point I encourage you to play around with various k-mers of various lengths as well as with various integers, and convert back and forth between k-mers and numbers. Also, try to write code to perform this conversion, and think about how you can most efficiently write this code. Most programming languages have bit level operators that can help you tremendously. As you write such code, you will also realize that the value of k determines how easy it is to code up a conversion algorithm. The types of integers available to you are usually capped at 32 or 64 bits, meaning that k-mers longer than 16 or 32 DNA letters will require you to "spread" a k-mer across multiple integers. Thankfully, for many applications, short k-mers suffice.
With this ability to convert DNA strings into integers, you can now re-write the pseudo-code as follows:
The function encode
performs the encoding of the string into an integer, which is then use as an index into the dictionary (or array) count
. If you want to print out the counts for the k-mers you simply use a new function decode(n, k)
which converts the integer n into a k-mer of length k. The pseudo-code may look like something like this:
ADVANCED ASIDE: Here we described a fairly straight-forward way of converting a DNA string to a number. This is a type of "hashing", and there's no reason why we couldn't use one of the many available hashing schemes to convert k-mers into integers. Importantly, though, our default hashing function is a bijection, which means that there's a unique correspondence between numbers and k-mers. Many hash functions allow "collisions", i.e., distinct k-mers that hash to the same value. In such cases, we can no longer uniquely determine which k-mer corresponds to a particular hash value. In certain applications, this is not an issue, and using a hash function that allows collisions can help reduce the number of bits needed to represent the k-mers, leading to memory or runtime efficiencies.
EXERCISE: Implement code that counts the k-mers in a genome by encoding and decoding them into integers. Play with different data structures for storing the "dictionary" that maps k-mers to their count and see how these choices, as well as the value of k, impact the runtime or memory usage of your code.
Now that we know how to count k-mers, we need to think about what we are looking for. We don't know what the sequence of oriC looks like, but we know it must be special in some way. We have settled on looking for unusually frequent k-mers, but what does unusual really mean? We are encountering here a paradigm that is quite common in exploratory data analysis (i.e., analyzing data to find something we don't know how to define). "Unusual" is typically defined as how different some observation is from our expectation. You have encountered this concept in statistics, where you computed how far you were observations were from the "null hypothesis".
If you go through the exercise of coding up the k-mer counting process and then look in the genome for k-mers that seem to be unusually frequent, you will likely find out that such k-mers occur all over the place. The genome contains many "unusual" regions, which should be no surprise as replication is only one of the processes performed by a cell. We need to narrow down the search to the unusual patterns that look more like the origin of replication, but how?
A first observation is that the origin of replication occurs at the same location in both of the strands of DNA. That means that the sequence of oriC must be unusual irrespective of whether you "read" it in the forward or reverse direction. By reverse, I'm referring to the reverse complement of the sequence represented by the second strand of DNA. For a k-mer, we refer to its reverse complement as its reciprocal counter-part. Thus, you can now modify your counting code to look for k-mers where both the k-mer itself and it's reverse complement are unusually frequent. Alternatively, you could simply add up the counts of each k-mer and of its reverse complement and look for unusually high counts. When combining the counts of a k-mer and its reverse complement, we usually store the information at the index, or under the key, representing the lexicographically smaller of the two. The counting code, thus, becomes something like this:
EXERCISE: Implement code that computes the reverse complement of a DNA string, and use this code to build a k-mer counter that records the counts for canonical k-mers (a combination of the counts for the forward and reverse versions of each k-mer).
Just a quick recap. So far we have tried to find oriC by finding frequent k-mers, but realized that too many k-mers in too many regions of the genome are flagged as unusually frequent. We then conjectured that oriC may be found by looking for k-mers that that are frequent not only in the forward orientation, but also in their reverse complement. If you've actually implemented this approach and tried it out, you probably have found out that it also doesn't work. Frequent k-mers, even if counting their reciprocal pair, appear in a lot of parts of the genome, again not narrowing down a potential location for the origin of replication. So we need to further tighten the parameters for what we are looking for.
Again, we go back to what we know about the origin of replication. It is a single location in the genome that is somehow unusual in comparison to other regions of the genome. Since cells need to quickly identify the origin of replication whenever they need to divide, we can assume that the signals hidden in this region are very strong. Perhaps the keyword region is what we were looking for. So far, when counting k-mers, we focused on their count irrespective of where they were located. What if we look for k-mers that occur frequently within a region of the genome. More precisely, we can define a clump of occurrences of a k-mer as the fact that a particular k-mer occurs frequently within a region of length L of the genome. We can formulate this as the clump finding problem: Find all k-mers of length k that occur more than t times within a window of length L in the genome.
You can define clumps both in terms of normal k-mers and in terms of k-mers and their reciprocal pair. Irrespective of the choice you make, you will now need to scan the genome with a window of length L, and keep track, for each k-mer, of the maximum number of times the k-mer occurs within one such window.
EXERCISE: Implement the clump finding problem. Try to think about ways of structuring the algorithm that allow it to be fast. If you count all k-mers in a window, then re-count them once you slide the window down the genome you'll waste a lot of computation. Bear in mind that only a few k-mers change as you slide the window down.
So, what is our null hypothesis here? Absent any additional information, we can say that each letter of DNA occurs with roughly the same frequency, meaning that each k-mer is equally likely to occur in the genome. Thus, we'd expect each k-mer to occur times since there are possible k-mers but only roughly len(G) k-mers that "fit" within the genome G. I say "roughly" here as the exact number does not matter much, but please think about exactly how may k-mers can exist in a genome of length L = len(G).
To come back to our original problem, we can count all k-mers in the genome and flag any k-mer that is more frequent than since such k-mers are surprisingly frequent given our null hypothesis. Here I'm a bit fast and loose, but you can come up with a statistically meaningful way to define not just the expected number of occurrences of each k-mer but also to better define what an "outlier" may look like. The value is simply the mean of the expected distribution of k-mer counts, and the "unusual" k-mers are those that occur in the tails of that distribution. This level of detail is irrelevant at this point, though, and even in practice, you may empirically just look for k-mers that "look" unusual, for example, because they are among the most frequent in the genome.