Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Leaderboard search
  • Spectral convolution
  • Computing linear and cyclic spectra efficiently
  • Exercises
  1. Proteomics data analysis

Dealing with errors in experimental spectra

Leaderboard search

Everything we've discussed so far assumes that the experimental spectrum is perfect. Our "compatibility" definition requires every mass in the theoretical spectrum to exist in the experimental spectrum. How can we handle errors, such as masses that are missing or that are slightly different from the theoretical guess? Clearly we have to be able to tolerate some level of mismatch between the spectrum of a peptide or sub-peptide and the experimental spectrum. To do so, we will define the score of a theoretical spectrum T against an experimental spectrum S, to be the number of masses in T that have a match in S. For example, given the experimental spectrum S = 87, 129, 147, 216, 234, 260, 329, 347, 363, 389, 476 and theoretical spectrum T = 87, 113, 129, 216, 242, 329, SCORE(S, T) = 4 since masses 87, 129, 216, and 329 from T are also found in S.

Note that this score must take the multiplicity of masses into account. If T=87, 113, 129, 129, 216, 242, 329; SCORE(S, T) is still equal to 4 since only one of the masses equal to 129 is found in S.

This score is well defined for both linear and cyclic spectra, and can be easily computed using a commonly used programming paradigm for merging two sorted lists (one of the steps in the merge sort algorithm).

Now that we have a score, we note that we run into a different problem. Before, the efficiency of our branch and bound algorithm was predicated on our ability to prune paths through the tree when the corresponding peptides were not compatible with the experimental spectrum. Now, however, every peptide is compatible with the spectrum, just with a different score, thus, it appears that we may have to explore the full search tree, i.e., look again through all possible peptides, something that we already decided is too expensive.

To regain efficiency, we will use a leaderboard technique. Specifically, at each iteration of the cyclopeptide sequencing algorithm we will prune the search space by eliminating the lowest scoring branches. Conversely, for some user-defined value N, we will retain only the top N candidate peptides, where "top" is defined by the score of the match between their theoretical spectra and the experimental spectrum.

The new algorithm looks something like this:

LeaderboardCyclopeptideSequencing(S)  # S is an experimental spectrum
candidate_peptides = []
bestPeptide = []

while True
  candidate_peptides = expand(candidate_peptides) # add each potential 									  # letter to end of each 									  # peptide
  for cp in candidate_peptides
     compute score(linear_spectrum(cp), S)

  sort candidate_peptides based on score
  retain to N values in candidate_peptides

  for cp in candidate_peptides
     if max(theoretical_spectrum(cp))==max(S) # total size matches
       if score(theoretical_spectrum(cp)) >   
          score(theoretical_spectrum(bestPeptide))
            bestPeptide = cp # found a new best peptide
       remove cp from candidate_peptides
     if max(theoretical_spectrum(cp)) > max(S) # cp too large
       remove cp from candidate_peptides

  if candidate_peptides is empty
     return bestPeptide

end while

Note that we now no longer output all compatible peptides (compare to the algorithm BBCyclopeptideSequencing above), rather we retain the peptide that best matches the experimental spectrum. It is also important to note, that once the experimental spectrum contains errors, we can no longer guarantee to find the correct peptide. For example, it's possible that none of the candidate peptides have a total mass that equals the largest mass in the experimental spectrum.

This is, unfortunately, the situation in the real world, and practical implementations of this algorithm require quite a bit of heuristic to account for the many errors that could occur in practice.

Spectral convolution

Above, we have assumed that we knew the masses of all the amino acids. What if we do not? Even if we know a lot about biology, there are still many things we don't yet know, including amino acids or amino-acid like molecules we haven't seen before. What can we do in this situation?

Let's think a bit about the masses in the spectrum. Take, for example, the masses in the spectrum for SELF: 87, 113, 129, 147, 216, 234, 242, 260, 329, 347, 363, 389, 476. Let's take all relative differences between these masses, resulting in the following table.

87

113

129

147

216

234

242

260

329

347

363

389

476

87

26

42

60

129

147

155

173

242

260

276

302

389

113

16

34

103

121

129

147

216

234

250

276

363

129

18

87

105

113

131

200

218

234

260

347

147

69

87

95

113

182

200

216

242

329

216

18

26

44

113

131

147

173

260

234

8

26

95

113

129

155

242

242

18

87

105

121

147

234

260

69

87

103

129

216

329

18

34

60

147

347

16

42

129

363

26

113

389

87

You will note, that several of the masses in the spectrum appear as differences, because the difference between two sub-peptides is exactly another sub-peptide or amino-acid. If we count the number of times each mass occurs in the original spectrum, and in the difference table, we get the following copy numbers: 87 (6), 113 (6), 129 (6), 147 (6), 216 (4), 234 (4), 242 (4), 260 (4), 329 (2), 347 (2), 363 (2), 389 (2), 476 (1). You can easily see that the most frequent values in this set of differences, also called the convolution of the experimental spectrum, are exactly the masses of the amino acids forming the peptide SELF (highlighted in bold). The table also highlights with underlines where the other masses in the spectrum are found in the convolution.

Thus, even if you don't know any masses, you can build the convolution of the spectrum, and assume that the individual amino-acids are represented by the most frequent masses within the convolution.

Why does this work?

Because the most common difference between two sub-peptides is exactly one amino acid long.

Computing linear and cyclic spectra efficiently

By now you must have already thought about how you may write the code that computes the spectrum (linear or cyclic) of a peptide. Most likely you came up with something like:

for i in 0, len(peptide)
  for j in i + 1, len(peptide)
    spectrum.append(mass(peptide[i..j]) # linear spectrum
    spectrum.append(mass(peptide[0..i]) 

	+ mass(peptide[j..len(peptide)]) #cyclic spectrum only

The last two lines only need to be included if we compute the cyclic spectrum, and represent the portion of the peptide that is not included in the range i, j, or the sub-peptide that wraps around the end of the cyclic peptide.

The problem with this algorithm is that, at each iteration you make a call to the method mass(peptide) which simply adds up the masses of the amino-acids in the peptide, taking time proportional to the length of the peptide. This leads to an overall run time of O(n3)O(n^3)O(n3), which is quite expensive.

You can see a possible improvement within the inner loop. Once you've computed mass(peptide[i..j]), you can compute in O(1)O(1)O(1) mass(peptide[i..j+1]) as mass(peptide[i..j]) + mass(peptide[j+1]) . Thus, the inner loop can be computed in O(n)O(n)O(n) time, at least for linear spectra. To handle cyclic spectra, you can note that the mass wrapping around the end of the peptide is simply the total mass of the peptide minus the linear segment, so you can also quickly compute that value, leading to an overall run time of O(n2)O(n^2)O(n2). You cannot beat this time because the spectrum itself contains O(n2)O(n^2)O(n2) values.

This new algorithm looks something like:

M = mass(peptide) # O(n)
for i in 0, len(peptide)
  mp = mass(peptide[i])
  spectrum.append(mp) #linear
  spectrum.append(M – mp) # wrap around
  for j in i + 1, len(peptide)
    mp += mass(peptide[j]
    spectrum.append(mp) # linear spectrum
    spectrum.append(M - mp)#cyclic spectrum only, wrap-around

A further improvement can be made by constructing a special "prefix-sum" array, that stores the masses of prefixes of the full peptide (i.e., mass(peptide[0..i]) for all i). This array can be constructed with the code:

ps[0] = mass(peptide[0]
for i in 1, len(peptide)
  ps[i] = mass(peptide[i]) + ps[i – 1]

How is this at all useful? First of all, there exist a number of parallel algorithms able to compute the prefix sum in parallel, thus this step can be accelerated a lot using modern hardware.

Second, note that mass(peptide[i..j]) = ps[j] – ps[i], thus we can replace all computations off a mass in the algorithm described above with a subtraction of the values in the ps array. Putting it all together:

ps[0] = mass(peptide[0]
for i in 1, len(peptide)
  ps[i] = mass(peptide[i]) + ps[i – 1]

M = mass(peptide) 
for i in 0, len(peptide)
  for j in i + 1, len(peptide)
    spectrum.append(ps[j] - ps[i]) # linear spectrum
    spectrum.append(M – (ps[j] - ps[i]))#cyclic spectrum only, wrap-around

IMPORTANT: The code, as presented above, won't necessarily work correctly as you have to figure out the boundaries of the various loops. Do you end at len(peptide) or len(peptide) – 1? Is it ps[j] – ps[i] or ps[j+1] – ps[i] or even ps[j] – ps[i-1]? This is for you to figure out.

Exercises

  1. Write efficient code that determines whether two mass spectra match each other.

  2. Describe a dynamic programming algorithm to compute the number of peptides that have a mass equal to a target mass M.

  3. Assume that you cannot simply add up the masses of amino-acids to compute the theoretical mass of a sub-peptide, and hence the theoretical spectrum of a peptide. Can you still use the BBCyclopeptideSequencing algorithm to find the peptide(s) compatible with an experimental spectrum? Justify your answer and indicate how you would modify the BBCyclopeptideSequencing algorithm in case you think you can still use it.

  4. What are the masses in the linear spectrum of peptide: GRAM?

  5. What are the masses in the cyclical spectrum of peptide: GRAM?

  6. Which of the following peptides is compatible with the experimental spectrum: 0 129 131 147 163 276 278 292 294 407 423 439 441 570

    1. 129-131-147-163

    2. 163-276-129-147

    3. 129-147-131-163

    4. 131-147-163-276

  7. Assume you have a collection of N objects of different weights (represented as an array of integers W) and a backpack with total capacity C (total weight of objects it can carry). Write in pseudo-code, or a programming language of your choice, an algorithm that optimally packs the backpack, i.e., fits a set of objects with a total weight as close to the capacity C as possible, of course without exceeding C.

    As an example, for W[] = [3, 5, 9, 12, 13, 15, 20, 22, 24], C = 82, a possible answer is [15, 20, 22, 24], with a total weight of 81.

  8. The Clarice Smith Center has 6 performances scheduled in one day. You do not know when the performances start, but you know all pairwise time-differences between the start times of the different performances: 1, 2, 2, 2, 3, 3, 3, 4, 5, 5, 5, 6, 7, 8,10. One of your friends has told you the start times of 3 performances as shown below. t1 = 0 . . . t5 = 8 . . . t6 = 10 Describe an algorithm for figuring out the start times of the other performances. What would the next step be in this algorithm?

PreviousCyclopeptide sequencingNextIntroduction to data clustering

Last updated 4 months ago