Introduction to pattern discovery
Last updated
Last updated
When first looking at biological data, it is easy to be discouraged. How can one write an algorithm to analyze the data without knowing exactly what we are looking for? As a computer scientist, you may know very little about biology. But even for biologists, most of the secrets hidden in biological sequences remain unknown. This was particularly true in the early days of genomics when the first DNA sequences were decoded. Yet, this scientific era led to many discoveries in biology AND in computer science. Much of our current understanding of biological system was derived in part through the use of computational tools developed by scientists who did not exactly know what they were looking for. How did these scientists translate the imprecisely stated (at least to computer scientists) problems posed by biologists into precise computational formulations for which algorithms could be developed? This chapter aims to introduce some paradigms that are fairly common in computational biology and data science more generally.
First, let's start with a basic biological question: "Where is the origin of replication in a bacterial genome?". As a computer scientist, you are probably already lost, so a bit of background (that you would normally get from your biologist colleague or by searching the internet) is in order.
Bacterial genomes are typically circular, which means that each of the sister strands of DNA forms a circle, wrapped around the other sister strand as seen below. Note that the strands have opposite directions (in terms of the 5' to 3' orientation described in the introduction to biology chapter), something that will be important later in the chapter.
When a cell starts dividing, it needs to make a copy of its DNA. To do so, as discussed in the chapter introducing basic biological concepts, the cell must separate out the two sister strands and then synthesize, for each, a new strand that is complementary to the strand being copied. In bacteria, this process starts at a fixed location in the genome, a region of the genome named the origin of replication which is also known as oriC. The replication proceeds from oriC in the form of a "replication bubble" that separates out the strands and, at the same time, starts synthesizing the corresponding copies. The "forks" at the end of the bubble can be seen as traveling along the circular chromosome in opposite directions. By the time the forks meet at the other end of the chromosome, the cell will have constructed two separate copies of its chromosome, as shown below.
Now that you understand the process a bit better, the biological question posed above should make a bit more sense. You are given a string of letters that represents the DNA sequence of the genome of a bacterium, and you are being asked to find a particular region within it that represents the origin of replication. Still, however, you are not much closer to a problem that you can solve computationally. If you do not know what the sequence of oriC looks like, how can you find it in a string of DNA?
At this point, you need to think a bit more about what may give you some hints about what you are looking for. You know that there's only one such sequence in the genome. You also know that this sequence is somewhat special. After all, the bacterium can easily find it when trying to replicate its genome. You are looking for some kind of hidden message that the bacterium knows how to find in the genome. If you phrase it like this, you may be reminded of various puzzles you have solved in the past, where the key was to match the frequency of letters in the coded message with the frequency of letters in the English language. Now you have something you can actually code up.