Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  1. Gene finding and annotation

Introduction to sequence annotation

PreviousFinding Eulerian toursNextGene finding

Last updated 4 months ago

Introduction

Obtaining the DNA sequence of an organism (through ) is just the first step in the computational analysis of the genome. The process of sequence annotation starts to assign meaning to the DNA sequence by linking segments of the sequence to information that sheds light on the role this segments play in a cell. We typically think of annotation at two levels: structural and functional. Structural annotation involves finding the "interesting" segments of the genome, while functional annotation is used to assign a potential function to these segments (see figure below). An example of interesting segments are genes—segments of the genome that encode the information needed to build proteins. When referencing genes, structural annotation is usually referred to as gene finding. However, there are many other interesting segments in a genome. For example, the decision to turn a gene on or off is determined by a special type of gene—transcription factor—which binds the DNA at a particular location within the promoter region of a gene. Finding the extent of promoter regions, thus, is another form of structural annotation. Similarly, finding the small stretches of DNA that are "recognized" in a promoter region by a particular transcription factor (transcription factor binding sites, or TFBS) is a form of structural annotation that is usually performed through a process called . There are many different types of interesting DNA segments (, centromeres, telomeres, CRISPR cassettes, transposons, etc.), and discussing all of them is beyond the scope of this chapter, however the principles used to find them that we describe here are broadly applicable.

CONFUSION ALERT: Here we use the term structural annotation to refer to the "structure" of the genome itself. One may use this term to also refer to the annotation of structural features of DNA, RNA, or protein sequences (i.e., referring to their 3-dimensional structure). That's a whole different (and very interesting) area of research that is beyond the scope of this chapter.

Once certain segments of the genome are identified through structural annotation, the process of functional annotation assigns each segment some form of meaning. For example, for a gene, functional annotation may determine whether that gene encodes for a transcription factor or for an enzyme. The annotation could go even deeper: it could determine which genes are influenced by the transcription factor, or what chemical reactions are catalyzed by the enzyme. For a non-gene segment, e.g., a binding site for transcription factors, the functional annotation may indicate which transcription factors recognized that specific site. A special type of functional annotation is taxonomic annotation—the process of assigning a taxonomic label to a sequence, i.e., determining which organism the sequence belongs to. To some extent this is not really a "function" but a representation of the evolutionary "history" of the sequence (more precisely—a representation of the evolutionary relatedness of the sequence with other sequences in databases).

genome assembly
motif finding
origins of replication
visual description of functional annotation (top) and structural annotation (bottom). The figure shows a DNA segment with several structural elements annotated as genes, transcription factor binding sites and promoter regions. The top part of the figure shows what functions these elements have, such as binding a specific binding site or performing an enzymatic function.