Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  1. String indexing

Introduction to string indexing

Introduction

In this chapter we'll focus on different approaches used to index strings or collections of strings such that subsequent queries can be performed more efficiently. When discussing string matching in the context of the Z and KMP algorithms, the Z and sp values were a form of indexing, but the indices were constructed for each pattern separately. More specifically, assume you have a text TTT of length nnn and a pattern P1P_1P1​ of length m1m_1m1​. The run time required to find P1P_1P1​ inside TTT is O(m1+n)O(m_1 + n)O(m1​+n) for either the Z algorithm or KMP (if this doesn't make sense, please read the chapter again).

Now, let's assume we have k different patterns: P1,...,PkP_1,..., P_kP1​,...,Pk​. If we try to search them against TTT, we incur a cost of O(mi+n)O(m_i + n)O(mi​+n) for each pattern iii, leading to a total runtime of O(nk+∑imi)O(nk + \sum_i m_i)O(nk+∑i​mi​) for all kkk patterns. We can't easily avoid the second part of this cost (∑imi\sum_i m_i∑i​mi​) as we have to look at every character in each pattern. But can we avoid going through the text kkk times? The indexing approaches described in this chapter will help us do so, and also provide some additional useful functionality.

PreviousMotif findingNextIntroduction to suffix trees

Last updated 4 months ago