Dealing with errors in experimental spectra
Leaderboard search
Everything we've discussed so far assumes that the experimental spectrum is perfect. Our "compatibility" definition requires every mass in the theoretical spectrum to exist in the experimental spectrum. How can we handle errors, such as masses that are missing or that are slightly different from the theoretical guess? Clearly we have to be able to tolerate some level of mismatch between the spectrum of a peptide or sub-peptide and the experimental spectrum. To do so, we will define the score of a theoretical spectrum T against an experimental spectrum S, to be the number of masses in T that have a match in S. For example, given the experimental spectrum S = 87, 129, 147, 216, 234, 260, 329, 347, 363, 389, 476 and theoretical spectrum T = 87, 113, 129, 216, 242, 329, SCORE(S, T) = 4 since masses 87, 129, 216, and 329 from T are also found in S.
Note that this score must take the multiplicity of masses into account. If T=87, 113, 129, 129, 216, 242, 329; SCORE(S, T) is still equal to 4 since only one of the masses equal to 129 is found in S.
This score is well defined for both linear and cyclic spectra, and can be easily computed using a commonly used programming paradigm for merging two sorted lists (one of the steps in the merge sort algorithm).
Now that we have a score, we note that we run into a different problem. Before, the efficiency of our branch and bound algorithm was predicated on our ability to prune paths through the tree when the corresponding peptides were not compatible with the experimental spectrum. Now, however, every peptide is compatible with the spectrum, just with a different score, thus, it appears that we may have to explore the full search tree, i.e., look again through all possible peptides, something that we already decided is too expensive.
To regain efficiency, we will use a leaderboard technique. Specifically, at each iteration of the cyclopeptide sequencing algorithm we will prune the search space by eliminating the lowest scoring branches. Conversely, for some user-defined value N, we will retain only the top N candidate peptides, where "top" is defined by the score of the match between their theoretical spectra and the experimental spectrum.
The new algorithm looks something like this:
Note that we now no longer output all compatible peptides (compare to the algorithm BBCyclopeptideSequencing above), rather we retain the peptide that best matches the experimental spectrum. It is also important to note, that once the experimental spectrum contains errors, we can no longer guarantee to find the correct peptide. For example, it's possible that none of the candidate peptides have a total mass that equals the largest mass in the experimental spectrum.
This is, unfortunately, the situation in the real world, and practical implementations of this algorithm require quite a bit of heuristic to account for the many errors that could occur in practice.
Spectral convolution
Above, we have assumed that we knew the masses of all the amino acids. What if we do not? Even if we know a lot about biology, there are still many things we don't yet know, including amino acids or amino-acid like molecules we haven't seen before. What can we do in this situation?
Let's think a bit about the masses in the spectrum. Take, for example, the masses in the spectrum for SELF: 87, 113, 129, 147, 216, 234, 242, 260, 329, 347, 363, 389, 476. Let's take all relative differences between these masses, resulting in the following table.
87
113
129
147
216
234
242
260
329
347
363
389
476
87
26
42
60
129
147
155
173
242
260
276
302
389
113
16
34
103
121
129
147
216
234
250
276
363
129
18
87
105
113
131
200
218
234
260
347
147
69
87
95
113
182
200
216
242
329
216
18
26
44
113
131
147
173
260
234
8
26
95
113
129
155
242
242
18
87
105
121
147
234
260
69
87
103
129
216
329
18
34
60
147
347
16
42
129
363
26
113
389
87
You will note, that several of the masses in the spectrum appear as differences, because the difference between two sub-peptides is exactly another sub-peptide or amino-acid. If we count the number of times each mass occurs in the original spectrum, and in the difference table, we get the following copy numbers: 87 (6), 113 (6), 129 (6), 147 (6), 216 (4), 234 (4), 242 (4), 260 (4), 329 (2), 347 (2), 363 (2), 389 (2), 476 (1). You can easily see that the most frequent values in this set of differences, also called the convolution of the experimental spectrum, are exactly the masses of the amino acids forming the peptide SELF (highlighted in bold). The table also highlights with underlines where the other masses in the spectrum are found in the convolution.
Thus, even if you don't know any masses, you can build the convolution of the spectrum, and assume that the individual amino-acids are represented by the most frequent masses within the convolution.
Why does this work?
Because the most common difference between two sub-peptides is exactly one amino acid long.
Computing linear and cyclic spectra efficiently
By now you must have already thought about how you may write the code that computes the spectrum (linear or cyclic) of a peptide. Most likely you came up with something like:
The last two lines only need to be included if we compute the cyclic spectrum, and represent the portion of the peptide that is not included in the range i, j, or the sub-peptide that wraps around the end of the cyclic peptide.
This new algorithm looks something like:
A further improvement can be made by constructing a special "prefix-sum" array, that stores the masses of prefixes of the full peptide (i.e., mass(peptide[0..i]) for all i). This array can be constructed with the code:
How is this at all useful? First of all, there exist a number of parallel algorithms able to compute the prefix sum in parallel, thus this step can be accelerated a lot using modern hardware.
Second, note that mass(peptide[i..j]) = ps[j] – ps[i]
, thus we can replace all computations off a mass in the algorithm described above with a subtraction of the values in the ps
array. Putting it all together:
IMPORTANT: The code, as presented above, won't necessarily work correctly as you have to figure out the boundaries of the various loops. Do you end at len(peptide)
or len(peptide) – 1
? Is it ps[j] – ps[i]
or ps[j+1] – ps[i]
or even ps[j] – ps[i-1]
? This is for you to figure out.
Exercises
Write efficient code that determines whether two mass spectra match each other.
Describe a dynamic programming algorithm to compute the number of peptides that have a mass equal to a target mass M.
Assume that you cannot simply add up the masses of amino-acids to compute the theoretical mass of a sub-peptide, and hence the theoretical spectrum of a peptide. Can you still use the BBCyclopeptideSequencing algorithm to find the peptide(s) compatible with an experimental spectrum? Justify your answer and indicate how you would modify the BBCyclopeptideSequencing algorithm in case you think you can still use it.
What are the masses in the linear spectrum of peptide: GRAM?
What are the masses in the cyclical spectrum of peptide: GRAM?
Which of the following peptides is compatible with the experimental spectrum: 0 129 131 147 163 276 278 292 294 407 423 439 441 570
129-131-147-163
163-276-129-147
129-147-131-163
131-147-163-276
Assume you have a collection of N objects of different weights (represented as an array of integers W) and a backpack with total capacity C (total weight of objects it can carry). Write in pseudo-code, or a programming language of your choice, an algorithm that optimally packs the backpack, i.e., fits a set of objects with a total weight as close to the capacity C as possible, of course without exceeding C.
As an example, for W[] = [3, 5, 9, 12, 13, 15, 20, 22, 24], C = 82, a possible answer is [15, 20, 22, 24], with a total weight of 81.
The Clarice Smith Center has 6 performances scheduled in one day. You do not know when the performances start, but you know all pairwise time-differences between the start times of the different performances: 1, 2, 2, 2, 3, 3, 3, 4, 5, 5, 5, 6, 7, 8,10. One of your friends has told you the start times of 3 performances as shown below. t1 = 0 . . . t5 = 8 . . . t6 = 10 Describe an algorithm for figuring out the start times of the other performances. What would the next step be in this algorithm?
Last updated