Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  1. Finding patterns in DNA

Leveraging biology

PreviousLooking for frequent k-mersNextFinding genes

Last updated 3 months ago

A new signal: the asymmetry of replication

Up to now, the algorithms we have used have relied on only a very high level understanding of biology: DNA is double-stranded; and some "message" may be hidden within to indicate where the origin of replication is located. It's possible we could use the clump finding algorithm to find the location of oriC, however it is more plausible that we will find many regions of the genome that contain "unusual" clumps of k-mers. After all, there are likely many other hidden signals inside a genome. It is time to learn a bit more about the process of replication and see if such knowledge can help us.

At the beginning of this chapter, we did not go into many details about exactly how DNA replication works. We talked about the replication "bubble" that grows from the origin of replication and moves away from it until the full genome is replicated, however we did not discuss how the replication itself works. The figure below zooms into the replication bubble and highlights the asymmetric nature of the replication process.

As a quick reminder, the DNA has a natural orientation from 5' to 3'. This orientation is defined by the direction in which the DNA is replicated. New letters are added to a DNA string at the 3' end of the new strand, hence the DNA can only "grow" in that direction. In the figure above, you can see the replication bubble formed between the two strands (black lines). The orientation of the DNA strands is highlighted by arrows: the two complementary strands occur in opposite directions. As the replication proceeds in both directions starting with the origin of replication (the short line in the middle of the strands), on the right side of the figure (labeled "leading strand"), the new strand of DNA (highlighted in magenta) can be synthesized as soon as the two original strands are unwound. Thus, the DNA is never single stranded on this side of the chromosome. On the left side (labeled "lagging strand"), however, the DNA is unwound in the "wrong" direction. The nucleotide complementary to the newly unwound nucleotide occurs on the 5' end of already synthesized DNA, yet new letters can only be added at the 3' end. As a result, the DNA must be allowed to unwind for a longer stretch, and only once enough DNA has been unwound (typically between 1000-2000 nucleotides) can the replication proceed. Thus, on the lagging strand, the DNA is replicated in spurts, and the newly synthesized strand is formed of a collection of fragments called Okazaki fragments. These fragments are later glued together into a continuous strand by a special enzyme called a ligase.

A side-effect of this asymmetric process is that the section of DNA close to the replication fork remains single stranded until there is enough room to create an Okazaki fragment. Single stranded DNA is more fragile because it is not protected by the complementary strand, increasing the risk of mutations. One particular mutation is due to a biochemical process called deamination, specifically, the removal of an amino group (NH2) from the nucleotide cytosine. After deamination, the cytosine becomes uracil, a nucleotide you may remember as the letter U within RNA molecules. Since uracil is not a proper DNA nucleotide, error correction mechanisms in the cell convert it to thymine (the letter T).

In short, in single stranded DNA, the following string of mutations can occur:

cytosine (C) –> [deamination] –> uracil (U) –> [error correction] –> thymine (T)

Sidebar. This mutation is quite common, however it is rare in double-stranded DNA in living cells because cellular error correction mechanisms can detect and fix it. Such C to U conversions are, however, common in ancient DNA since DNA repair is no longer active in dead tissue, and scientists studying DNA from neanderthals or mammoths or other extinct creatures, must correct for it before comparing the ancient DNA to that of organisms living today. This mutation, however, is also useful in confirming that scientists are actually looking at ancient DNA rather than some kind of modern contaminant. End sidebar.

Coming back to replication, due to an accumulation of C to T mutations in the lagging strand, the number of Cs on the lagging strand tends to decrease over time. While we expect a roughly equal number of Cs and Gs in the chromosome, these letters will be unevenly distributed along the genome, with more Gs on the lagging strand (since the Cs have been converted to Ts), and more Cs on the leading strand (since the Gs correspond to Cs on the complementary strand which are depleted because that is the lagging strand of the complementary DNA).

Thus, if we move along a chromosome in the direction 5'-3', we expect the number of Gs to increase with respect to the number of Cs until we reach the terminus of replication (the point diametrically opposed to the origin of replication), at which point the number of Gs decreases with respect to the number of Cs until we reach the origin of replication. Thus, the origin of replication should be the point where the value of skew=(num(G)–num(C))skew = (num(G) – num(C))skew=(num(G)–num(C)) is minimized. You can see a plot of this function for the bacterium Clostridioides difficile in the figure below.

To compute the skew function represented in the plot above, you can simply move along the genome keeping track of the number of Gs and Cs observed so far, and computing the skew as the difference between those two numbers. In the figure, the origin of replication happens to be at the beginning of the chromosome, however, more generally, the origin of replication can be found at the location with the lowest value of the skew.

The figure above shows a particularly clean example of skew. In many cases, the shape of the skew function can be quite complex, making it hard to identify the actual location of the origin of replication. If that is the case, a combination of the ideas presented in this chapter may be necessary to find the origin of replication (e.g., a region with low G-C skew and multiple frequent k-mers that clump together).

Note: Sometimes the skew is defined as a ratio: skew=#G−#C#G+#C skew = {{\#G - \#C} \over {\#G + \#C}}skew=#G+#C#G−#C​

EXERCISE: Implement code that computes the skew function along a genome and that reports the locations of the lowest and highest values of this function. Implement both the simple (#G−#C\#G-\#C#G−#C) and ratio (#G−#C#G+#C{\#G-\#C}\over{\#G +\#C}#G+#C#G−#C​) definitions of the skew and see which one produces an easier to interpret signal.

Representation of the replication "bubble" where the 5' to 3' orientation of the outer strand is represented as a black left arrow. The right side of the origin of replication displays the replication of the leading strand, where a long magenta right arrow indicates continuous "growth" of the complimentary strand. The left side displays the replication of the lagging strand, with three consecutive short magenta left arrows that highlight the creation of multiple Okazaki fragments that makes up the complimentary strand.
The asymmetry of replication. On the leading strand, the DNA is replicated as soon as the double helix has been unwound. On the lagging strand, replication takes place in "spurts" since the direction in which the DNA is unwound is moving against the natural 5' to 3' direction of the complimentary strand.. Instead of a continuous strand, the complementary strand is built from a series of short fragments (called Okazaki fragments). Furthermore, the DNA next to the replication fork remains single-stranded until sufficient bases have been exposed to allow the creation of a new Okazaki fragment. For simplicity, we only show this process for one of the strands.
Plot of the difference between the number of Gs and Cs in a genome. A line at approximately 45 degrees increases from the origin of replication (left) to a maximum at the terminus of replication (middle of plot), then decreases towards the origin of replication (right end of the plot).
Plot of the G-C skew for the genome of the bacterium Clostridioides difficile. The skew increases from the origin of replication to the terminus of replication, then decreases again, reaching its minimum value at the origin of replication (since the genome is circular, the two ends of the plot represent the same location in the genome). Image source:
https://berthub.eu/articles/skewdb/