Leveraging biology
Last updated
Last updated
Up to now, the algorithms we have used have relied on only a very high level understanding of biology: DNA is double-stranded; and some "message" may be hidden within to indicate where the origin of replication is located. It's possible we could use the clump finding algorithm to find the location of oriC, however it is more plausible that we will find many regions of the genome that contain "unusual" clumps of k-mers. After all, there are likely many other hidden signals inside a genome. It is time to learn a bit more about the process of replication and see if such knowledge can help us.
At the beginning of this chapter, we did not go into many details about exactly how DNA replication works. We talked about the replication "bubble" that grows from the origin of replication and moves away from it until the full genome is replicated, however we did not discuss how the replication itself works. The figure below zooms into the replication bubble and highlights the asymmetric nature of the replication process.
As a quick reminder, the DNA has a natural orientation from 5' to 3'. This orientation is defined by the direction in which the DNA is replicated. New letters are added to a DNA string at the 3' end of the new strand, hence the DNA can only "grow" in that direction. In the figure above, you can see the replication bubble formed between the two strands (black lines). The orientation of the DNA strands is highlighted by arrows: the two complementary strands occur in opposite directions. As the replication proceeds in both directions starting with the origin of replication (the short line in the middle of the strands), on the right side of the figure (labeled "leading strand"), the new strand of DNA (highlighted in magenta) can be synthesized as soon as the two original strands are unwound. Thus, the DNA is never single stranded on this side of the chromosome. On the left side (labeled "lagging strand"), however, the DNA is unwound in the "wrong" direction. The nucleotide complementary to the newly unwound nucleotide occurs on the 5' end of already synthesized DNA, yet new letters can only be added at the 3' end. As a result, the DNA must be allowed to unwind for a longer stretch, and only once enough DNA has been unwound (typically between 1000-2000 nucleotides) can the replication proceed. Thus, on the lagging strand, the DNA is replicated in spurts, and the newly synthesized strand is formed of a collection of fragments called Okazaki fragments. These fragments are later glued together into a continuous strand by a special enzyme called a ligase.
A side-effect of this asymmetric process is that the section of DNA close to the replication fork remains single stranded until there is enough room to create an Okazaki fragment. Single stranded DNA is more fragile because it is not protected by the complementary strand, increasing the risk of mutations. One particular mutation is due to a biochemical process called deamination, specifically, the removal of an amino group (NH2) from the nucleotide cytosine. After deamination, the cytosine becomes uracil, a nucleotide you may remember as the letter U within RNA molecules. Since uracil is not a proper DNA nucleotide, error correction mechanisms in the cell convert it to thymine (the letter T).
In short, in single stranded DNA, the following string of mutations can occur:
cytosine (C) –> [deamination] –> uracil (U) –> [error correction] –> thymine (T)
Sidebar. This mutation is quite common, however it is rare in double-stranded DNA in living cells because cellular error correction mechanisms can detect and fix it. Such C to U conversions are, however, common in ancient DNA since DNA repair is no longer active in dead tissue, and scientists studying DNA from neanderthals or mammoths or other extinct creatures, must correct for it before comparing the ancient DNA to that of organisms living today. This mutation, however, is also useful in confirming that scientists are actually looking at ancient DNA rather than some kind of modern contaminant. End sidebar.
Coming back to replication, due to an accumulation of C to T mutations in the lagging strand, the number of Cs on the lagging strand tends to decrease over time. While we expect a roughly equal number of Cs and Gs in the chromosome, these letters will be unevenly distributed along the genome, with more Gs on the lagging strand (since the Cs have been converted to Ts), and more Cs on the leading strand (since the Gs correspond to Cs on the complementary strand which are depleted because that is the lagging strand of the complementary DNA).
To compute the skew function represented in the plot above, you can simply move along the genome keeping track of the number of Gs and Cs observed so far, and computing the skew as the difference between those two numbers. In the figure, the origin of replication happens to be at the beginning of the chromosome, however, more generally, the origin of replication can be found at the location with the lowest value of the skew.
The figure above shows a particularly clean example of skew. In many cases, the shape of the skew function can be quite complex, making it hard to identify the actual location of the origin of replication. If that is the case, a combination of the ideas presented in this chapter may be necessary to find the origin of replication (e.g., a region with low G-C skew and multiple frequent k-mers that clump together).
Thus, if we move along a chromosome in the direction 5'-3', we expect the number of Gs to increase with respect to the number of Cs until we reach the terminus of replication (the point diametrically opposed to the origin of replication), at which point the number of Gs decreases with respect to the number of Cs until we reach the origin of replication. Thus, the origin of replication should be the point where the value of is minimized. You can see a plot of this function for the bacterium Clostridioides difficile in the figure below.
Note: Sometimes the skew is defined as a ratio:
EXERCISE: Implement code that computes the skew function along a genome and that reports the locations of the lowest and highest values of this function. Implement both the simple () and ratio () definitions of the skew and see which one produces an easier to interpret signal.