Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  1. Finding patterns in DNA

Finding genes

PreviousLeveraging biologyNextIntroduction to exact string matching

Last updated 3 months ago

As a bonus to the discussion so far, we focus here on the task of finding genes inside a bacterial genome. A more in-depth discussion of gene finding is available . Just like it was the case when we were looking for the origin of replication, we don't really know what we're looking for. Here's what we know, based on the information in the .

Genes are DNA segments that occur between the codons ATG (the start codon) and one of TAA, TAG, or TGA (the stop codons). A DNA segment between a start and a stop codon that are in frame (are separated by a number of letters divisible by 3), is called an open reading frame (ORF), and may represent a gene.

But which ORFs are likely to be genes? We can use a similar logic as the one used to find the origin of replication. The ORFs that are genes are likely to be unusual ORFs. But unusual how?

Let's think for a second. Assuming DNA is random, how long can we expect an ORF to be?

STOP for a few minutes and try to come up with an answer.

Since we have 64 codons (triplets of DNA letters) and 3 of them are stop codons, we'd expect a stop codon to occur roughly every 20 codons (or 60 base-pairs). Thus, the expected length of an ORF is 60 base-pairs, assuming DNA is perfectly random and each codon is equally likely to occur. Any ORF that is substantially longer than 60 base-pairs is unusual, thus may represent a gene.

Sidebar. In practice, bacterial genes are detected by training machine learning algorithms (initially based on Hidden Markov Models) to recognize gene sequences, and distinguish ORFs that are genes from ORFs that are not. However, the simple logic described here, was often used to identify a set of ORFs that were very likely to be genes, ORFs that were then used to train the machine learning algorithm. End sidebar.

As we've mentioned earlier, the numbers of Gs and Cs in a genome are roughly equal, as are the number of As and the number of Ts. The relative proportion of G/Cs versus A/Ts does vary from genome to genome, and scientists usually refer to this ratio as a rough characterization of the genome. Saying a genome is "50% GC" (or conversely "50% AT") means that the genome has roughly equal numbers of As, Cs, Gs, and Ts. Some genomes are "GC rich", when the number of G and C letters is much larger than the number of A and T letters. Such genomes tend to belong to organisms that live in hot environments (such as in hot springs) because the bond between the complementary bases G and C is stronger than that between A and C. The stronger bonds are necessary to prevent the two DNA strands from separating from each other at high temperatures. Conversely, some genomes are "AT rich" because they contain a lot more As and Ts than Cs and Gs.

So, what does this have to do with finding ORFs? The probability of observing a particular codon depends on the probability of observing each of its nucleotides. Assuming a 1/4 probability for each of A, C, G, and T, the probability of observing a start codon is simply p(ATG)=p(A)⋅p(T)⋅p(G)=1/64=0.062p(ATG) = p(A)\cdot p(T) \cdot p(G) = 1/64 = 0.062p(ATG)=p(A)⋅p(T)⋅p(G)=1/64=0.062. What if we are looking at a genome with high AT content (e.g., p(A)=p(T)=0.4,p(C)=p(G)=0.1p(A) = p(T) = 0.4, p(C) = p(G) = 0.1p(A)=p(T)=0.4,p(C)=p(G)=0.1). Then, p(ATG)=p(A)⋅p(T)⋅p(G)=0.016p(ATG) = p(A) \cdot p(T) \cdot p(G) = 0.016p(ATG)=p(A)⋅p(T)⋅p(G)=0.016, which means that start codons are rarer in this genome.

EXERCISE: Think about the null hypothesis about the length of genes in genomes with varying AT or GC contents. How easy do you think it is to find unusually long genes in a genome that is AT rich? How about a genome that is GC rich?

here
chapter on biology