Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Introduction
  • Biological application 1: Finding the origin of replication in bacterial genomes
  • Bacterial replication
  • What are we looking for?
  1. Finding patterns in DNA

Introduction to pattern discovery

PreviousEthical considerationsNextLooking for frequent k-mers

Last updated 3 months ago

Introduction

When first looking at biological data, it is easy to be discouraged. How can one write an algorithm to analyze the data without knowing exactly what we are looking for? As a computer scientist, you may know very little about biology. But even for biologists, most of the secrets hidden in biological sequences remain unknown. This was particularly true in the early days of genomics when the first DNA sequences were decoded. Yet, this scientific era led to many discoveries in biology AND in computer science. Much of our current understanding of biological system was derived in part through the use of computational tools developed by scientists who did not exactly know what they were looking for. How did these scientists translate the imprecisely stated (at least to computer scientists) problems posed by biologists into precise computational formulations for which algorithms could be developed? This chapter aims to introduce some paradigms that are fairly common in computational biology and data science more generally.

Biological application 1: Finding the origin of replication in bacterial genomes

First, let's start with a basic biological question: "Where is the origin of replication in a bacterial genome?". As a computer scientist, you are probably already lost, so a bit of background (that you would normally get from your biologist colleague or by searching the internet) is in order.

Bacterial replication

Bacterial genomes are typically circular, which means that each of the sister strands of DNA forms a circle, wrapped around the other sister strand as seen below. Note that the strands have opposite directions (in terms of the 5' to 3' orientation described in the introduction to biology chapter), something that will be important later in the chapter.

When a cell starts dividing, it needs to make a copy of its DNA. To do so, as discussed in the chapter introducing basic biological concepts, the cell must separate out the two sister strands and then synthesize, for each, a new strand that is complementary to the strand being copied. In bacteria, this process starts at a fixed location in the genome, a region of the genome named the origin of replication which is also known as oriC. The replication proceeds from oriC in the form of a "replication bubble" that separates out the strands and, at the same time, starts synthesizing the corresponding copies. The "forks" at the end of the bubble can be seen as traveling along the circular chromosome in opposite directions. By the time the forks meet at the other end of the chromosome, the cell will have constructed two separate copies of its chromosome, as shown below.

What are we looking for?

Now that you understand the process a bit better, the biological question posed above should make a bit more sense. You are given a string of letters that represents the DNA sequence of the genome of a bacterium, and you are being asked to find a particular region within it that represents the origin of replication. Still, however, you are not much closer to a problem that you can solve computationally. If you do not know what the sequence of oriC looks like, how can you find it in a string of DNA?

At this point, you need to think a bit more about what may give you some hints about what you are looking for. You know that there's only one such sequence in the genome. You also know that this sequence is somewhat special. After all, the bacterium can easily find it when trying to replicate its genome. You are looking for some kind of hidden message that the bacterium knows how to find in the genome. If you phrase it like this, you may be reminded of various puzzles you have solved in the past, where the key was to match the frequency of letters in the coded message with the frequency of letters in the English language. Now you have something you can actually code up.

Bacterial chromosomes are circular. The figure indicates the origin of replication as well as the direction of the sister strands.
Progression of the replication "bubble" from the origin of replication (marked by a line) to the terminus of replication. As the two replication forks proceed along the chromosomes (arrows in the left-most panel), the single strands are used as a template to create new strands (shown in magenta). At the end (right-most panel) the chromosome is fully copied into two identical copies.
Two concentric circles representing a double-stranded bacterial chromosome. The origin of replication (OriC) is highlighted at the top, and two arrows in opposite directions highlight the 5' to 3' orientation for each strand
A three-panel figure representing different stages of DNA replication. Left panel: early stage of DNA replication. A replication "bubble" starts at the origin of replication as the two original strands (shown as black circles) are separated, and complementary DNA (shown as magenta arcs) is synthesized at both replication forks for each strand of the original DNA. Middle panel: middle stage of DNA replication, which demonstrates how replication forks travel along the original circular chromosome in opposite directions. Right panel: the end of DNA replication. Both of the original strands have a complete complementary DNA strand, now represented as magenta circles.