Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  1. Gene finding and annotation

Introduction to Hidden Markov Models

PreviousGene findingNextTaxonomic and functional annotation

Last updated 4 months ago

As we discuss the use of various machine learning approaches in sequence annotation, it is useful to provide an overview of one of the computational approaches used in this context—Hidden Markov Models (HMMs). While HMMs are increasingly being replaced by other machine learning approaches (notably neural networks), they provide a useful introduction to the different ways in which machine learning is used in sequence annotation.

In their most basic form, HMMs encode a Markov process—a stochastic process in which the probability of an event occurring is determined only by prior events. In a Hidden Markov Model, a set of observable events are determined by a set of "hidden" states. As a concrete example, in gene finding, the states may be: "intergenic region", "promoter region", "exon", "intron"; while the observable events would be nucleotides or codons. The different states are connected to each other by transitions, each occurring with a certain probability, and each of the events is derived from one or more states with some probability. HMMs are usually represented as a graph where the states and the observable events are nodes, connected to each other by directed edges, each parameterized with a probability value (see below).

Note that HMMs are generative models—by following the edges according to the given probabilities, the model will move from state to state, in each state outputting an observation, thus generating a plausible DNA sequence containing genes.

The training of such a model entails learning the probabilities from actual DNA sequences for which the state is known (i.e., a genome where the genes have been annotated). Mathematically, the training algorithm attempts to change the probabilities on the edges so that the likelihood that the model will generate the observed DNA sequences is maximized.

Once a model is created, one can ask a number of questions. The forward algorithm, for example, can compute the probability that the model would generate a given sequence. Such a probability could be used to determine which ORFs are potential genes. The Viterbi algorithm, can be used to find the sequence of states that is most likely given an observed sequence of DNA. This algorithm can be used, for example, to segment a stretch of DNA into genes and intergenic regions. Both the forward and Viterbi algorithms are variants of a computational paradigm called dynamic programming. A more detailed description of these algorithms is beyond the scope of this chapter.

A different version of HMMs is used to model protein families. These HMMs are called profile HMMs (pHMMs) and attempt to encode the structure of proteins from the same family (here, "family" refers to a group of proteins with the same function, rather than the taxonomic level with the same name). Profile HMMs encode the prototypical sequence profile of a protein from the targeted family, as well as potential deviations from this profile. Specifically, there are three types of hidden states (see figure below): match states – these represent the "consensus" of the sequences from the protein family; insertion states – these represent the insertion of one or more amino-acids between two sequences in the consensus; deletion states – these represent the deletion of one or more amino-acids from the consensus. The pHMM outputs amino-acids in the match and insertion states, according to probabilities learned during training. Transition probabilities between the different states are also learned. Training is typically performed by constructing a multiple sequence alignment of proteins known to belong to a particular protein family, then estimating the size of the model (number of "match" states) and the appropriate set of probabilities from the aligned sequences.

A potential Hidden Markov Model for gene finding. The hidden states (circles) represent the different features of a DNA sequence, while the observations (rectangles) represent the information "emitted" by the model, i.e., nucleotides or codons. Each edge is parameterized by a probability value (omitted for simplicity).
Profile hidden Markov Model. The states correspond to Matches, Insertions, and Deletions. The beginning and end of the protein are also modeled as separate states. Match and Insertion states also have "output/emission" edges (omitted for brevity) indicating the probability with which every amino-acid would be output by the model when entering that state.
A simple hidden markov model represented as a graph composed of four states - intergenic, start codon, gene, and stop codon. The edges show the possible transitions between the states, and emission edges show the letters or codons that are emitted by the model while it is in the corresponding state.
Graphical representation of a profile hidden markov model. Squares represent match states. Circles, representing deletion states, allow match states to be skipped.  Diamonds, representing insertion states, occur between match states, allowing the insertion of new characters.