Introduction to Hidden Markov Models

As we discuss the use of various machine learning approaches in sequence annotation, it is useful to provide an overview of one of the computational approaches used in this context—Hidden Markov Models (HMMs). While HMMs are increasingly being replaced by other machine learning approaches (notably neural networks), they provide a useful introduction to the different ways in which machine learning is used in sequence annotation.

In their most basic form, HMMs encode a Markov process—a stochastic process in which the probability of an event occurring is determined only by prior events. In a Hidden Markov Model, a set of observable events are determined by a set of "hidden" states. As a concrete example, in gene finding, the states may be: "intergenic region", "promoter region", "exon", "intron"; while the observable events would be nucleotides or codons. The different states are connected to each other by transitions, each occurring with a certain probability, and each of the events is derived from one or more states with some probability. HMMs are usually represented as a graph where the states and the observable events are nodes, connected to each other by directed edges, each parameterized with a probability value (see below).

A simple hidden markov model represented as a graph composed of four states - intergenic, start codon, gene, and stop codon. The edges show the possible transitions between the states, and emission edges show the letters or codons that are emitted by the model while it is in the corresponding state.
A potential Hidden Markov Model for gene finding. The hidden states (circles) represent the different features of a DNA sequence, while the observations (rectangles) represent the information "emitted" by the model, i.e., nucleotides or codons. Each edge is parameterized by a probability value (omitted for simplicity).

Note that HMMs are generative models—by following the edges according to the given probabilities, the model will move from state to state, in each state outputting an observation, thus generating a plausible DNA sequence containing genes.

The training of such a model entails learning the probabilities from actual DNA sequences for which the state is known (i.e., a genome where the genes have been annotated). Mathematically, the training algorithm attempts to change the probabilities on the edges so that the likelihood that the model will generate the observed DNA sequences is maximized.

Once a model is created, one can ask a number of questions. The forward algorithm, for example, can compute the probability that the model would generate a given sequence. Such a probability could be used to determine which ORFs are potential genes. The Viterbi algorithm, can be used to find the sequence of states that is most likely given an observed sequence of DNA. This algorithm can be used, for example, to segment a stretch of DNA into genes and intergenic regions. Both the forward and Viterbi algorithms are variants of a computational paradigm called dynamic programming. A more detailed description of these algorithms is beyond the scope of this chapter.

A different version of HMMs is used to model protein families. These HMMs are called profile HMMs (pHMMs) and attempt to encode the structure of proteins from the same family (here, "family" refers to a group of proteins with the same function, rather than the taxonomic level with the same name). Profile HMMs encode the prototypical sequence profile of a protein from the targeted family, as well as potential deviations from this profile. Specifically, there are three types of hidden states (see figure below): match states – these represent the "consensus" of the sequences from the protein family; insertion states – these represent the insertion of one or more amino-acids between two sequences in the consensus; deletion states – these represent the deletion of one or more amino-acids from the consensus. The pHMM outputs amino-acids in the match and insertion states, according to probabilities learned during training. Transition probabilities between the different states are also learned. Training is typically performed by constructing a multiple sequence alignment of proteins known to belong to a particular protein family, then estimating the size of the model (number of "match" states) and the appropriate set of probabilities from the aligned sequences.

Graphical representation of a profile hidden markov model. Squares represent match states. Circles, representing deletion states, allow match states to be skipped.  Diamonds, representing insertion states, occur between match states, allowing the insertion of new characters.
Profile hidden Markov Model. The states correspond to Matches, Insertions, and Deletions. The beginning and end of the protein are also modeled as separate states. Match and Insertion states also have "output/emission" edges (omitted for brevity) indicating the probability with which every amino-acid would be output by the model when entering that state.

Last updated