Introduction to sequence annotation
Last updated
Last updated
Obtaining the DNA sequence of an organism (through ) is just the first step in the computational analysis of the genome. The process of sequence annotation starts to assign meaning to the DNA sequence by linking segments of the sequence to information that sheds light on the role this segments play in a cell. We typically think of annotation at two levels: structural and functional. Structural annotation involves finding the "interesting" segments of the genome, while functional annotation is used to assign a potential function to these segments (see figure below). An example of interesting segments are genes—segments of the genome that encode the information needed to build proteins. When referencing genes, structural annotation is usually referred to as gene finding. However, there are many other interesting segments in a genome. For example, the decision to turn a gene on or off is determined by a special type of gene—transcription factor—which binds the DNA at a particular location within the promoter region of a gene. Finding the extent of promoter regions, thus, is another form of structural annotation. Similarly, finding the small stretches of DNA that are "recognized" in a promoter region by a particular transcription factor (transcription factor binding sites, or TFBS) is a form of structural annotation that is usually performed through a process called . There are many different types of interesting DNA segments (, centromeres, telomeres, CRISPR cassettes, transposons, etc.), and discussing all of them is beyond the scope of this chapter, however the principles used to find them that we describe here are broadly applicable.
CONFUSION ALERT: Here we use the term structural annotation to refer to the "structure" of the genome itself. One may use this term to also refer to the annotation of structural features of DNA, RNA, or protein sequences (i.e., referring to their 3-dimensional structure). That's a whole different (and very interesting) area of research that is beyond the scope of this chapter.
Once certain segments of the genome are identified through structural annotation, the process of functional annotation assigns each segment some form of meaning. For example, for a gene, functional annotation may determine whether that gene encodes for a transcription factor or for an enzyme. The annotation could go even deeper: it could determine which genes are influenced by the transcription factor, or what chemical reactions are catalyzed by the enzyme. For a non-gene segment, e.g., a binding site for transcription factors, the functional annotation may indicate which transcription factors recognized that specific site. A special type of functional annotation is taxonomic annotation—the process of assigning a taxonomic label to a sequence, i.e., determining which organism the sequence belongs to. To some extent this is not really a "function" but a representation of the evolutionary "history" of the sequence (more precisely—a representation of the evolutionary relatedness of the sequence with other sequences in databases).