Gene finding
Last updated
Last updated
We will focus first on the techniques used to discover the location of genes in genome sequences. We distinguish below between methods that rely solely on the sequence being analyzed—methods that are commonly called called ab initio (from the beginning in Latin)—and methods that leverage other types of data—commonly referred to as comparative gene finding.
Before describing the computational approaches used to discover genes, we revisit key concepts related to the —a theory that describes the flow of "information" through a cell, from DNA, to RNA, and to proteins. Importantly, this theory is increasingly viewed as incomplete and oversimplifies the flow of information within a cell, however it is adequate for the purpose of understanding gene finding.
In prokaryotes, genes are organized in operons—segments of DNA containing one or more genes that get transcribed together. The process of transcription copies these segments into messenger RNA (mRNA), which then gets converted into an amino-acid string through translation. The amino-acid string generate during translation folds up into a three-dimensional structure to become an active protein.
In eukaryotes, the process is a bit more complicated. Genes are "built" from multiple segments called exons and introns. During transcription, the entire gene is copied into a pre-messenger RNA molecule, which then gets spliced into the mRNA that eventually gets translated into a protein. By choosing different combinations of introns and exons, eukaryotic organisms can generate multiple proteins from the same gene (see below).
Due to the difference in gene structure between prokaryotic and eukaryotic organisms, gene finding approaches also differ. Specifically, in prokaryotic gene finding, a gene is simply defined by its start and its end, while in eukaryotic gene finding one must also detect the location of introns and exons within the gene.
We will start with the simpler case of prokaryotic gene finding, and focus on just the task of identifying the location of the gene within the genome (i.e., we are not discussing the related task of finding the boundaries of the transcribed region – the operon containing the gene). Biology provides a bit of help since genes (almost) always start with the same start codon: ATG. Genes also (almost) always terminate with one of three possible stop codons: TAA, TAG, or TGA. Between a start and a stop codon, the translation process reads the mRNA in groups of 3 nucleotides called codons. Thus, for a DNA segment to represent a gene, it must start with ATG, end with one of the three stop codons (TAA, TAG, or TGA), and have a length that is a multiple of three. Any DNA segment that has these three properties is called an open reading frame (ORF).
Computationally, finding ORFs is a fairly easy task. A possible algorithm is described below:
A more efficient process can be designed by recognizing that we can figure out which codons are in the same "reading frame" by examining their starting coordinate along the genome. The codons that are "in frame" all have the same remainder when dividing their starting coordinate by 3. There are, thus, 3 different frames in a strand of DNA, which can be represented by the numbers 0, 1, and 2 (the only possible remainders when dividing an integer by 3).
Thus, we traverse the genome as before, but during a first pass we'll simply record each start and stop codon and their location in a separate list depending on their reading frame. Then, it suffices to traverse each list and construct START-STOP pairs, representing ORFS. The pseudo-code may look like:
STOP AND THINK: How many elements do you expect each of the three lists to contain for a genome of size G?
So, let's do a bit of math – first, each of the lists reflects a traversal of the genome that looks at only one third of the locations, hence, the size of each list is bounded by G/3. Furthermore, each list contains one of four possible "special" codons, either the start or one of the three stop codons. Since each codon has three nucleotides, there are 43=64 possible codons. Since only one in 4 is special, only every 4th out of each 64 locations in the genome will be recorded in one of the lists. Thus, each list is expected to have a size of (G/3)*(4/64) = G/48.
To obtain an even faster algorithm, we can gain some intuition by looking at one of the lists possibly created by the algorithm we described above:
START, START, START, STOP, START, STOP, START, START, START, START, STOP, START, ...
We have highlighted the stop codons in bold font. Since translation ends when a stop codon is detected, an ORF can only occur between a start codon and the nearest stop codon following it. Thus, the loop highlighted as line (A) in the code described above, only has to look at the average number of elements occurring between two stop codons.
STOP AND THINK: How many codons occur, on average, between two stop codons in one of the frame-specific lists?
Let's think a little bit. Since 3 out of 4 codons are stop codons, we expect the size of the stretches between two stop codons are expected to be just 4/3 codons (or just over 1), including the orresponding stop codon. Thus, on average, the loop at line (A) in the code described above only has to loop over one codon, and is only executed about once every 3 elements. Thus, the ORF finding algorithm is a lot more efficient than could be inferred from a simple analysis of the pseudo-code.
The lists created by our ORF finding algorithm, thus, may look more like the following than the one I have shown above:
STOP, STOP, START, STOP, STOP, START, STOP, STOP, STOP, STOP, START, START, STOP,
EXERCISE: Implement a version of the ORF finding algorithm outlined above and apply it to several genome sequences. Do the lists look like the examples I've given you? Is there a difference in the size of the ORFs you detect for different genomes? Try Plasmodium falciparum (NCBI accession GCF_000002765.6), Escherichia coli (NCBI accession GCF_000005845.2), Streptomyces coelicolor (NCBI accession GCF_008931305.1), Carsonella rudii (NCBI accession GCF_000287275.1). How can you explain this difference?
Let's return for a second to one of the lists presumably created by our ORF finding algorithm:
START, START, START, STOP, START, STOP, START, START, START, START, STOP, START, ...
You should note that, in this list, there are multiple start codons for each stop codon. Are all proper start-stop pairs ORFS? (By proper, I mean a start paired with the nearest stop codon following it—a proper ORF can only have one stop codon at its end). The answer is yes: any proper pairing of a start and stop codon is a proper reading frame. An ORF can contain one or more start codons because the start codon ATG encodes for the amino-acid methionine (M). Any gene encoding for a protein that uses one or more methionine residues will, thus, contain one or more start codons.
Since we have multiple possible ORFs, how do we decide which ORFs represent actual genes? In ab initio gene finding, this is task is performed through machine learning. This is, actually, one of the first uses of machine learning in genomics, dating back to the 1990s. Each ORF, generated by the algorithm outlined above, is evaluated by a machine learning model that determines whether the ORF is likely to represent a gene. Originally, gene finding was performed with the help of Hidden Markov Models (HMMs), however conditional random fields and deep learning have also been used for this task.
In this chapter, we will not go into detail about how the different machine learning algorithms work, but it is important to spend a bit of time on how these models are being trained. Ab initio gene finding can be viewed as a form of binary classification—an ORF is a gene or it isn't. To train a machine learning model to perform binary classification, you need to provide it with examples to learn from, both positive examples (i.e., ORFs you know are genes) and negative examples (i.e., ORFs you know are not genes). To determine positive examples, one can use the sequence of previously characterized genes from the organism of interest. If insufficient such examples are available, you can use a statistical argument to determine which ORFs are likely to be genes. As we saw earlier, stop codons are quite frequent, occurring on average every 20 codons (there are 64 total codons and 3 of them are stop codons). Thus, the length of an average ORF should be around 20 codons. Any ORFs that are much longer than 20 codons are unusual, thus likely to be genes. You could, thus, come up with a heuristic to define which ORFs are long (e.g., by setting a length cutoff, or by choosing the longest X% of the ORFs), and use these as the positive examples. The negative examples can be generated by constructing random DNA sequences. To avoid constructing sequences that contain stop codons, the random construction process should be done at a codon level. That is, use a random number generator to select from among the 61 codons that are not stop codons as you build ORFs.
As you construct a training set for your machine learning model, you need to be careful to avoid overfitting. Make sure that both the training and the testing sets have similar characteristics: similar G/C content, and similarly sized ORFs. If, for example, all the negative examples are short and all the positive examples are long, the gene finder may use gene length as the primary feature for identifying genes, even though length is not a reliable feature from a biological point of view.
It is also important to try to match, as much as possible, the features of the organism you are analyzing. Different organisms have different codon usage patterns and G/C content, thus a gene finder developed for E. coli, for example, may not work well for Bacteroides fragilis.
STOP AND THINK: In gene finding, do you need to look at both strands of DNA?
The answer is yes—the genome sequences provided in public databases only represent one DNA strands, but genes could exist on both strands. Gene finding must, therefore, be applied to both sequence orientations.
EXERCISE: Construct a basic prokaryotic gene finder by combining an off-the-shelf machine learning tool and the ORF finder described earlier. Construct the positive examples from long ORFs, and negative examples by randomly generating ORF sequences. Apply it to a known genome and see how close you get to the actual annotation. What happens when you change the training data (e.g., by varying the length cutoff for "long" genes, or changing the random generation process for negative examples.
USEFUL TIP: The gene finding approach described above can be easily parallelized. Different segments of the genome can be processed independently of each other, and each ORF can be evaluated by a machine learning classifier independently of the other ORFs.
The information provided above should allow you to construct a decent prokaryotic gene finder, however any practical tool will need to handle some additional complexity. First, we implied above that we can choose between the different possible start codons simply by feeding the ORFs to a machine learning algorithm. This is not entirely accurate. Such a strategy may not be able to distinguish between possible start codons that occur close to each other (to the model, the different sequences will look similar enough). A separate process, called start site prediction, needs to be used to refine the gene "models" constructed by a gene finder.
A second issue relates to the fact that prokaryotic genes are transcribed together in operons. Knowing which genes are co-transcribed is important for understanding gene regulation, and gene finding alone cannot address this question. Operon prediction is, thus, another important task in structural annotation.
As it should be clear from the description of eukaryotic gene expression, the strategies used to find genes in prokaryotic genomes cannot be applied directly to eukaryotic genomes. The "clean" ORFs only occur in the mature mRNA, and it's sequence is not directly reflected in the genome. Finding an eukaryotic gene involves finding not just the start and stop codons, but also the location of all the splice sites within the gene. Thus, splice site prediction is an important part of ab initio eukaryotic gene finding. Further complications arise from the fact that exons can be very short, which means that they contain insufficient information for a machine learning model to distinguish them from DNA sequences that do not represent exons. For these reasons, eukaryotic gene finding is more frequently performed through comparative approaches.
In contrast with ab initio gene finding, which relies solely on the sequence being analyzed, comparative approaches leverage other data to determine where the genes are located. The simplest approach in prokaryotic genomes is a modification of the algorithm we discussed earlier for prokaryotic gene finding.
As a reminder, in a first step of the algorithm, we used an ORF finding algorithm to enumerate all the ORFs in the genome, then used machine learning to determine which ORFs are genes. We could replace the machine learning step with a comparative strategy as follows. Each ORF is converted into the corresponding amino-acid sequence, then the resulting sequences are searched against a protein database. If enough similarity is identified between an ORF's amino-acid sequence and a database sequence, then we can infer that the ORF represents a gene. Note that this process does not detect genes that do not encode for proteins, though one could search against gene sequence databases instead, capturing a broader collection of prokaryotic genes. In general, protein-level alignments are more effective at capturing distant evolutionary relationships, while genes that do not encode for proteins require DNA-level searches, which can only retrieve closely related sequences. Furthermore, this approach is not effective for organisms that are poorly studied (i.e., they are too distantly related to the sequences found in databases).
Another comparative strategy that leverages protein information involves the use of protein models—a type of generative machine learning that allows one to encode the structure of a particular type of protein. One of the paradigms used in this context is that of profile Hidden Markov Models (pHMMs) which are related to but distinct from the HMMs used in ab initio gene finding. A pHMM encodes the broad characteristics of proteins from a particular protein family, and can be used to assess whether a particular ORF could belong to that family, even if that ORF is not similar to any protein that was previously characterized. The HMMer package used to generate and use pHMMs also contains tools that can be used to find segments of a DNA sequence that "align" well to a given pHMM.
If multiple related genome sequences are available, evolutionary arguments can be used to determine gene regions. Since genes are important, they are less likely to have their sequence mutated (or more precisely, their sequence evolves slower) than genomic regions that are not genes. Thus, one can compare related genomes to each other and identify the segments that are similar to each other, which may represent segments that are conserved during evolution.
There are some caveats to using this approach. First, there are other genomic segments beyond genes that may be conserved during evolution. In other words, a comparative strategy may yield false positive (predicted genes that are not actually genes). Second, the evolutionary distance between the genomes compared to each other impacts which genes can be detected. At low evolutionary distances, almost the entire genome will appear to be "conserved", rendering the comparative approach uninformative, while at long evolutionary distances, genes may diverge sufficiently from each other leading to many false negatives (true genes missed by the comparative approach).
EXERCISE: Use either MUMmer or Minimap2 to align to each other related bacterial genome sequences. Do the shared regions correspond to genes? How do the results change as you align genomes at different evolutionary distances from each other?
While we are presenting this last, the use of gene expression data is perhaps the most direct technique for comparative gene finding. Technologies exist that can measure the RNA found in cells through DNA sequencing. Specifically, the RNA is converted to complementary DNA (cDNA) using an enzyme called a reverse transcriptase. The resulting cDNA can then be sequenced using one of the current sequencing technologies. Thus, we can look directly at the outcome of the transcription process. By mapping the sequencing reads to the genome, we should be able to see the location of the genes.
As a historical note, this strategy was used to characterize many human genes in the 1990s, before the human genome sequence was eventually reconstructed in 2001. The sequenced transcripts were called Expressed Sequence Tags (ESTs). Once modern sequencing technologies have started to be used, the process described above has started to be referred to as RNA sequencing (RNA-seq). There are some subtle differences between what used to be referred to as EST sequencing and modern RNA-seq, but it is important to acknowledge the historical legacy of modern technologies.
In its simplest form, the cDNA sequencing reads generated through RNA-seq can be aligned directly to the genome. Better results may be obtained by first assembling the reads together to reconstruct nearly complete transcript sequences, which are then aligned to the genome. The assembly process can resolve ambiguities and remove errors, thus yielding cleaner results. One caveat, however, is that errors in the transcript assembly process will then be transferred to the gene finding process. There's no free lunch.
This comparative strategy is particularly valuable for eukaryotic gene finding since, as described above, the boundaries between introns and exons are difficult to determine through ab initio methods. By focusing the RNA-seq process on the mature mRNA (usually performed by capturing the poly-A "tail" of the mature mRNA), the alignment to the genome sequence will reveal where the exons are located (the segments of the genome that match the transcript).
Importantly, most sequence alignment tools assume most of the sequence being aligned will match somewhere in the genome, and therefore are ill-suited for finding spliced alignments. To assist methods for comparative gene finding on the basis of RNA-seq data, specialized alignment tools have been developed. An example of a spliced alignment tool is TopHat.
As a final comment on comparative gene finding using gene expression data, it is important to recognize that not all genes (or all splice-forms of an eukaryotic gene) are expressed at all times in a cell. As an example, some genes are only ever expressed in the early stages of an embryo. Thus, the data resulting from a single RNA-seq experiment typically only reflect a fraction of the genes from an organism. To effectively detect most genes, it is critical to carefully design the sequencing experiment to capture the cells under a sufficient set of diverse conditions.
Note that this algorithm runs in time proportional to , where G is the length of the genome, since we are trying to find an ORF at each location in the genome, and finding the stop codon aligned with a given start codon may require looking through the entirety of the remaining part of the genome.
This process is still run time, but N refers to the number of elements in each of the frame-specific lists which is much shorter than the entirety of the genome.
EXERCISE: Use the HMMer package to find genes within a genome that align to one of the profiles from the protein families (PFAM) database: .