Taxonomic and functional annotation

Introduction

The first part of this chapter focused on the structural annotation of a genome sequence, i.e., finding important "features" along the genome sequence. Taxonomic and functional annotation refer to processes used to assign "meaning" to these feature. The meaning assigned by taxonomic annotation is the name of the organism from which the sequence originates. The meaning assigned by functional annotation is the function of the particular sequence. Before we proceed, it is important to recognize that the words "name" and "function" are poorly-defined in this context. We will return to this point later in the chapter. For now, however, we will simply assume that our goal is to assign a label to every feature detected by the structural annotation software, and we'll ignore the actual meaning of these labels. Thus, as a first approximation, we are looking at both taxonomic and functional annotation through the same lens.

Key computational approaches for annotation

Most often, when assigning labels to a gene, we are implicitly assuming a classification task. The label represents the group of genes that all have the same property. In machine learning terminology, we are performing multi-class classification—since a sequence could be assigned multiple possible taxonomic/functional labels; in contrast to a binary classification (e.g., friend versus foe). Thus, the different techniques for multi-class classification can be directly applied to biological sequences as well (the Wikipedia page for multi-class classification provides an extensive introduction).

Historically, the most widely-used approach for classification (for both taxonomic and functional annotation) has been a nearest-neighbor strategy, built upon database search. In the broader classification field, a k-nearest-neighbor strategy identifies the k labeled objects that are closest to the object we are trying to classify (the query), then assigns to the query object the label assigned to the majority of its k neighbors. Frequently, in bioinformatics applications k=1, i.e., a sequence is assigned the label of the labeled sequence that most closely resembles it. This is the setting of a typical BLAST search—we often assume that the first BLAST "hit" can reveal the label of the query sequence. This strategy is very attractive since it doesn't require careful training—it is sufficient to have a database that contains labeled sequences. However, this approach is very sensitive to database errors (it suffices for the nearest neighbor to be incorrectly labeled for the results of the analysis to be wrong). For a broader discussion of the limitations database searches for annotation, see this write-up.

Another common approach for classification is the use of machine learning to learn the characteristics of sequences with a given label. The functional annotation of proteins is frequently performed using profile Hidden Markov Models (pHMMs). The InterPro database (https://www.ebi.ac.uk/interpro/) integrates multiple protein databases and contains curated pHMM models for many proteins and protein domains. These models can be queried using the HMMER (https://hmmer.org) package. Among the many tools included in this package are: hmmsearch—tool that searches a sequence database for sequences matching a given pHMM profile; hmmscan—tool that searches a sequence against a database of profiles; and hmmbuild—tool that can be used to construct pHMM profiles from a set of protein sequences (i.e., the tool used to train the classifier).

EXERCISE: Use the HMMER package to identify which ORF in an organism corresponds to a particular gene family, such as RecA. What HMMER tool would you use? Use the ORF finding tool you developed earlier when we discussed gene finding. Now pick one specific ORF and use the HMMER tool to find out which protein families most resemble that ORF. What HMMER tool would you use for this purpose?

EXERCISE: Build your own protein family model for a chosen protein family. Use it to characterize the ORFs from one or more organisms. How does it compare to the InterPro model that is the most related to the protein family you are studying? How would you decide which model is better?

Assessing confidence in classification

It is important to recognize that the output of search tools and classifiers comes with estimates of confidence and/or quality. You should never use a tool that does not. Commonly, this confidence is given as an E-value—the expected number of random sequences that would generate an equal or higher score. By score we either refer to the alignment score (for database searches) or the "fit" with a protein family model (for pHMM searches). One can also consider how much of the sequence or protein model is covered by the alignment (information produced by some tools such as HMMER or BLAST), or look at the score itself. Thus, before assigning a label to a sequence, it is important to determine not just whether a near neighbor or profile match exists, but also whether the sequence matches this near neighbor or profile "well enough", i.e., exceeding a particular cutoff in terms of score.

The score cutoffs themselves can be "learned", as done for example by metaphyler [1] in the context of taxonomic classification. The RGI tool for detecting antibiotic resistance genes [2] also carefully tunes the cutoffs necessary to determine whether a sequence corresponds to a certain antibiotic resistance gene. We can view this as a form of meta-learning—a classifier is trained to interpret the confidence values produced by another classifier.

Controlled vocabularies

So far we haven't discussed what the labels assigned to sequences represent. As we have described, the goal of annotation is to assign to a sequence a "is a type of" relationship to some pre-defined entity. While humans can handle ambiguity in definitions and language, computers cannot, therefore it is important to consistently define the labels that we want to assign to biological sequences. As an example, a human would be able to realize that the words "mountain lion", "cougar", or "puma" refer to the same animal. If, however, we want to train a computer to distinguish this animal from other big cats, we should label all mountain lion instances with the same label., such as the scientific name of the animal—Puma concolor. Thus, while free-form descriptions of biological entities are associated with database entries, modern databases also assign each entry a set of labels from well-defined controlled vocabularies.

There are two broad categories of such controlled vocabularies: taxonomies and ontologies. The former are a context-independent way of classifying objects, while the latter are labels that may vary depending on context. Most importantly, however, for both taxonomies and ontologies, one cannot make up new labels, rather the labels are developed and controlled by relevant committees. Before we go into more details about these concepts, it is critical that you appreciate the fact that neither taxonomies nor ontologies represent "biological truth" and that both change regularly. They are simply convenient and consistent nomenclature that allows scientists (and software) to communicate unambiguously at a particular point in time. We are emphasizing "a particular point in time" because both taxonomies and ontologies change regularly, and, therefore, the labels they contain are consistent only with other labels from the same version of the controlled vocabulary. You cannot "mix and match" between different versions of a taxonomy or ontology, and we'll provide some examples below.

Biological taxonomies

In the biological realm, taxonomy is most commonly associated with the classification system for natural objects created by Linnaeus in the 1700s. In its modern version for living organisms and viruses, the taxonomy information is organized as a hierarchy. Thus, each organism is assigned several categories at different levels of resolution. At the lowest level of resolution, the NCBI taxonomy comprises several superkingdoms—Bacteria, Archaea, Eukarya, and Viruses—corresponding to the main domains of life. These are further broken down into more and more specific taxonomic levels, including: kingdom, phylum, class, order, family, genus, and species. Note that we referred here to a specific taxonomy, since there are several competing approaches for classifying organisms. To repeat an important point—names are only consistent within a given taxonomy and version, and naming schemes should not be mixed with each other. Even though the different taxonomies are largely consistent with each other, key differences occur that may lead to analytic errors. Perhaps a contrived example is the organism Clostridioides difficile—an important human pathogen that is colloquially referred to as C. diff, and that causes chronic and life-threatening diarrhea particularly in the elderly. This organism used to be called Clostridium difficile and later on Peptoclostridium difficile, then the nomenclature settled on the current name. If provided data labeled with different versions of the NCBI taxonomy, a computer program may incorrectly assume there are three distinct organisms, where in fact all the labels refer to the same bacterium.

The current naming conventions for taxonomic labels assign a binomial label to species, and single word labels for all taxonomic levels above species. The binomial label includes both the name of the genus and species, e.g., Escherichia coli. The genus label, in this case, would be Escherichia, though it is not uncommon for the "genus" portion of a name to no longer be consistent with the actual genus that the organism belongs to, in cases where the taxonomy has been adjusted by the species labels retained for historical consistency. Different strains within a given species are distinguished by adding additional (semi-structured) information to the species name, e.g.,: Escherichia coli O157; Escherichia coli O157:H7 str. 06-3745.

To avoid confusion, the NCBI taxonomy also assigns each organism a unique taxonomy identifier (TaxId). The NCBI taxonomy is designed to be flexible, allowing for an arbitrary number of taxonomic levels to be used to describe a sequence. The full taxonomy tree is represented computationally in two files: names.dmp—linking TaxIds to taxonomic names; and nodes.dmp—linking taxonomic "nodes" to their taxonomic rank, the TaxId of their parent, and to other information.

Here are examples of the typical content of these files:

names.dmp:

562 | Escherichia coli | | scientific name |

The corresponding fields, separated by | and white space, are:

tax_id – the id of node associated with this name
name_txt – name itself
unique name – the unique variant of this name if name not unique
name class – (synonym, common name, ...) Note that the name class itself is a controlled vocabulary.

nodes.dmp:

562 | 561 | species | EC | 0 | 1 | 11 | 1 | 0 | 1 | 1 | 0 | |

The corresponding fields, separated by | and white space, are:

tax_id – node id in GenBank taxonomy database
parent tax_id – parent node id in GenBank taxonomy database
rank – rank of this node (superkingdom, kingdom, ...)
embl code – locus-name prefix; not unique
division id – see division.dmp file
inherited div flag (1 or 0) – 1 if node inherits division from parent
genetic code id – see gencode.dmp file
inherited GC flag (1 or 0) – 1 if node inherits genetic code from parent
mitochondrial genetic code id – see gencode.dmp file
inherited MGC flag (1 or 0) – 1 if node inherits mitochondrial gencode from parent
GenBank hidden flag (1 or 0) – 1 if name is suppressed in GenBank entry lineage
hidden subtree root flag (1 or 0) – 1 if this subtree has no sequence data yet
comments – free-text comments and citations

EXERCISE: Write code that parses the names.dmp and nodes.dmp files downloaded from NCBI (ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/) and outputs the full lineage of a given TaxId.

It should be clear from this brief description that parsing the NCBI taxonomy information is non-trivial. To simplify processing of taxonomy information for bacteria, sometimes the information is limited to the major 7 ranks: kingdom, phylum, class, order, family, genus, and species. Thus, the full taxonomy can be listed unambiguously as a string, with the different levels separated by semicolons:

Bacteria; Pseudomonadota; Gammaproteobacteria; Enterobacterales; Enterobacteriaceae; Escherichia-Shigella; Escherichia coli .

Other approaches have also been proposed to avoid having to parse the tree information in the format used by NCBI, e.g., by prefixing each name with its level (as proposed by the Greengenes database): e.g., g_Escherichia; s_Escherichia coli—with g_ and s_ representing genus and species, respectively.

EXERCISE: Write code that parses the names.dmp and nodes.dmp files downloaded from NCBI (ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/) and outputs just the 7 main levels of the taxonomy for a given TaxId.

The gene ontology

As a prototypical ontology, we'll briefly describe the Gene Ontology, a major effort to characterize the function of genes. The "function" is defined in terms of 3 categories: Molecular function (e.g., protein kinase activity), Cellular component (e.g., membrane), and Biological process (e.g., DNA repair).

Just like the NCBI taxonomy, the Gene Ontology has a hierarchical structure that captures more or less general/specific properties of biological sequences. Unlike the taxonomy, however, the ontology allows each term to have more than one parent, i.e., the ontology is structured as a directed acyclic graph (DAG). Acyclic refers to the fact that there are no cycles in the graph, i.e., no term can be its own ancestor (see figure below). Each biological sequence may be assigned multiple ontology terms.

Representation of the ontolgy hierarchy for the hexose biosynthetic process. The figure shows a directed acyclic graph that highlights different properties at different levels of resolution.

Due to their higher complexity, ontologies cannot be easily simplified the way taxonomies are, thus, their structure has to be parsed computationally. There are two main formats: OBO (http://owlcollab.github.io/oboformat/doc/obo-syntax.html) and OWL (https://github.com/owlcs/owlapi) which are quite complex and not easily parsed. Instead APIs exist that allow you to programmatically access ontology information.

EXERCISE: Follow links from https://geneontology.org/docs/download-ontology/ to find an appropriate API for processing ontologies, and write a small piece of code that can perform basic queries on gene ontology files.

Other types of annotation

The NCBI taxonomy and the Gene Ontology are just two of many different ways of annotating biological sequences. This chapter is not intended as a comprehensive summary of all annotation methods, but we want to briefly highlight a few of the key other approaches used in computational biology.

Enzymes (a type of protein that catalyzes biochemical reactions) are usually assigned an Enzyme Commission number, e.g., EC 3.4.11.4. The numbers separated by periods represent a hierarchy of function information going from more general (on the left) to more specific (on the right). For example, the number we just mentioned can be decoded as: 3 – hydrolase, 4 – acting on peptide bonds, 11 – that cleave off the amino-terminal amino acid of a peptide, and 4 – that specifically target tripeptides.

In some cases, the annotation of a sequence is intended to help build a mechanistic model of a biological system. A good example is metabolic modeling, where the goal is to create computational models of the chemical reactions within a cell that enable the cellular simulation of cellular activity. To build such models, it is not sufficient to know what general function is performed by a sequence, but also precise information about the reactions involving the sequence. Examples of databases that contain such information are the Kyoto Encyclopedia of Genes and Genomes (KEGG, https://www.genome.jp/kegg/) and the US Department of Energy's KBase (https://www.kbase.us/).

In this chapter, we referred to "structural annotation" in terms of the annotation of segments of DNA within a genome, rather than the 3-dimensional structure of biological molecules. The annotation of the structural features of proteins, is in itself a fascinating area of research. The most basic form involves the characterization of the secondary structure of proteins (determining where helixes, turns, and beta-sheets occur within a protein sequence), but there are many other features that can be annotated, such as DNA binding domains or active sites.

As a final note, we also want to mention that taxonomy is inextricably linked to phylogenetic analysis, the study of the evolutionary relationship between biological sequences. An introduction to the key computational concepts in phylogenetic analysis is provided in a different chapter.

Is it OK to overfit?

Throughout this section on taxonomic and functional annotation, we have assumed that the goal of computational classification tools is to classify sequences according to some broad classes. In this context, it is important that the computational tools "learn" broad features of the class of sequences represented by a label in a way that can generalize to unseen sequences. Thus, we'd like to know that a sequence is Escherichia coli even if it doesn't exactly match any of the Escherichia coli genomes that we have sequenced to date. To achieve this goal, careful attention is being devoted in the machine learning community on techniques that avoid overfitting—building a model that represents the features of the data in the training set but fails to capture more general features of the class. The concept of overfitting is tightly connected to the concept of confounders in statistics, and you may have heard of examples in other areas of machine learning. For example, a computer vision algorithm may fail to detect a car in a field simply because all the training data involved cars on roads (i.e., the model incorrectly learned that cars must be located on roads).

It is important to recognize, however, that overfitting is sometimes the desired feature. The police investigating a bank robbery are not hoping to find the general group of humans wearing dark clothes and masks, but to identify the specific humans who robbed the bank. In biological applications, a similar setting occurs in diagnostics—we care less that an organism is correctly classified as E. coli, and want to determine whether it is one of the few E. coli bacteria that poses a health risk. Thus, machine learning models used in diagnostics may intentionally overfit in order to avoid making overly-general predictions.

Selected references

1. Liu B, Gibbons T, Ghodsi M, Treangen T, Pop M (2011) Accurate and fast estimation of taxonomic profiles from metagenomic shotgun sequences. BMC Genomics, 12 Suppl 2:S4. https://doi.org/10.1186/1471-2164-12-S2-S4

2. Jia B, Raphenya AR, Alcock B, Waglechner N, Guo P, Tsang KK, Lago BA, Dave BM, Pereira S, Sharma AN, Doshi S, Courtot M, Lo R, Williams LE, Frye JG, Elsayegh T, Sardar D, Westman EL, Pawlowski AC, Johnson TA, Brinkman FS, Wright GD, McArthur AG (2017) CARD 2017: expansion and model-centric curation of the comprehensive antibiotic resistance database. Nucleic Acids Res, 45(D1):D566–D573. https://doi.org/10.1093/nar/gkw1004

PreviousIntroduction to Hidden Markov Models

Last updated 8 months ago