Introduction to pattern discovery

Introduction

When first looking at biological data, it is easy to be discouraged. How can one write an algorithm to analyze the data without knowing exactly what we are looking for? As a computer scientist, you may know very little about biology. But even for biologists, most of the secrets hidden in biological sequences remain unknown. This was particularly true in the early days of genomics when the first DNA sequences were decoded. Yet, this scientific era led to many discoveries in biology AND in computer science. Much of our current understanding of biological system was derived in part through the use of computational tools developed by scientists who did not exactly know what they were looking for. How did these scientists translate the imprecisely stated (at least to computer scientists) problems posed by biologists into precise computational formulations for which algorithms could be developed? This chapter aims to introduce some paradigms that are fairly common in computational biology and data science more generally.

Biological application 1: Finding the origin of replication in bacterial genomes

First, let's start with a basic biological question: "Where is the origin of replication in a bacterial genome?". As a computer scientist, you are probably already lost, so a bit of background (that you would normally get from your biologist colleague or by searching the internet) is in order.

Bacterial replication

Bacterial genomes are typically circular, which means that each of the sister strands of DNA forms a circle, wrapped around the other sister strand as seen below. Note that the strands have opposite directions (in terms of the 5' to 3' orientation described in the introduction to biology chapter), something that will be important later in the chapter.

Two concentric circles representing a double-stranded bacterial chromosome. The origin of replication (OriC) is highlighted at the top, and two arrows in opposite directions highlight the 5' to 3' orientation for each strand

When a cell starts dividing, it needs to make a copy of its DNA. To do so, as discussed in the chapter introducing basic biological concepts, the cell must separate out the two sister strands and then synthesize, for each, a new strand that is complementary to the strand being copied. In bacteria, this process starts at a fixed location in the genome, a region of the genome named the origin of replication which is also known as oriC. The replication proceeds from oriC in the form of a "replication bubble" that separates out the strands and, at the same time, starts synthesizing the corresponding copies. The "forks" at the end of the bubble can be seen as traveling along the circular chromosome in opposite directions. By the time the forks meet at the other end of the chromosome, the cell will have constructed two separate copies of its chromosome, as shown below.

A three-panel figure representing different stages of DNA replication. Left panel: early stage of DNA replication. A replication "bubble" starts at the origin of replication as the two original strands (shown as black circles) are separated, and complementary DNA (shown as magenta arcs) is synthesized at both replication forks for each strand of the original DNA. Middle panel: middle stage of DNA replication, which demonstrates how replication forks travel along the original circular chromosome in opposite directions. Right panel: the end of DNA replication. Both of the original strands have a complete complementary DNA strand, now represented as magenta circles.

What are we looking for?

Now that you understand the process a bit better, the biological question posed above should make a bit more sense. You are given a string of letters that represents the DNA sequence of the genome of a bacterium, and you are being asked to find a particular region within it that represents the origin of replication. Still, however, you are not much closer to a problem that you can solve computationally. If you do not know what the sequence of oriC looks like, how can you find it in a string of DNA?

At this point, you need to think a bit more about what may give you some hints about what you are looking for. You know that there's only one such sequence in the genome. You also know that this sequence is somewhat special. After all, the bacterium can easily find it when trying to replicate its genome. You are looking for some kind of hidden message that the bacterium knows how to find in the genome. If you phrase it like this, you may be reminded of various puzzles you have solved in the past, where the key was to match the frequency of letters in the coded message with the frequency of letters in the English language. Now you have something you can actually code up.

PreviousEthical considerations NextLooking for frequent k-mers

Last updated 6 months ago