Introduction to Hidden Markov Models
Last updated
Last updated
As we discuss the use of various machine learning approaches in sequence annotation, it is useful to provide an overview of one of the computational approaches used in this context—Hidden Markov Models (HMMs). While HMMs are increasingly being replaced by other machine learning approaches (notably neural networks), they provide a useful introduction to the different ways in which machine learning is used in sequence annotation.
In their most basic form, HMMs encode a Markov process—a stochastic process in which the probability of an event occurring is determined only by prior events. In a Hidden Markov Model, a set of observable events are determined by a set of "hidden" states. As a concrete example, in gene finding, the states may be: "intergenic region", "promoter region", "exon", "intron"; while the observable events would be nucleotides or codons. The different states are connected to each other by transitions, each occurring with a certain probability, and each of the events is derived from one or more states with some probability. HMMs are usually represented as a graph where the states and the observable events are nodes, connected to each other by directed edges, each parameterized with a probability value (see below).
Note that HMMs are generative models—by following the edges according to the given probabilities, the model will move from state to state, in each state outputting an observation, thus generating a plausible DNA sequence containing genes.
The training of such a model entails learning the probabilities from actual DNA sequences for which the state is known (i.e., a genome where the genes have been annotated). Mathematically, the training algorithm attempts to change the probabilities on the edges so that the likelihood that the model will generate the observed DNA sequences is maximized.
Once a model is created, one can ask a number of questions. The forward algorithm, for example, can compute the probability that the model would generate a given sequence. Such a probability could be used to determine which ORFs are potential genes. The Viterbi algorithm, can be used to find the sequence of states that is most likely given an observed sequence of DNA. This algorithm can be used, for example, to segment a stretch of DNA into genes and intergenic regions. Both the forward and Viterbi algorithms are variants of a computational paradigm called dynamic programming. A more detailed description of these algorithms is beyond the scope of this chapter.
A different version of HMMs is used to model protein families. These HMMs are called profile HMMs (pHMMs) and attempt to encode the structure of proteins from the same family (here, "family" refers to a group of proteins with the same function, rather than the taxonomic level with the same name). Profile HMMs encode the prototypical sequence profile of a protein from the targeted family, as well as potential deviations from this profile. Specifically, there are three types of hidden states (see figure below): match states – these represent the "consensus" of the sequences from the protein family; insertion states – these represent the insertion of one or more amino-acids between two sequences in the consensus; deletion states – these represent the deletion of one or more amino-acids from the consensus. The pHMM outputs amino-acids in the match and insertion states, according to probabilities learned during training. Transition probabilities between the different states are also learned. Training is typically performed by constructing a multiple sequence alignment of proteins known to belong to a particular protein family, then estimating the size of the model (number of "match" states) and the appropriate set of probabilities from the aligned sequences.