Cyclopeptide sequencing
Last updated
Last updated
Now that we know how to estimate the theoretical spectrum of a peptide, we can finally describe the goal of this chapter. Assume you are a scientist who has managed to purify a cyclic peptide but who doesn't know the sequence of amino-acids forming it. You "measure" the peptide with a mass spectrometer, using the process described above which first breaks the peptide up into small pieces and obtain an experimental spectrum for the peptide. Does the experimental spectrum contain enough information to allow you to figure out the sequence of letters forming your peptide? Specifically, you are trying to solve the following problem:
Cyclopeptide sequencing: Given an experimental spectrum S (a sorted list of masses and their relative abundance as measured by an instrument), identify a cyclic peptide P such that spectrum(P) = S.
First, we need to decide what it means for two spectra to match. As we saw in the , the mass spectrum contains both a set of masses, and how frequent they are in the spectrum. We can glean the same information from the theoretical spectrum. The spectrum for peptide SELF shown in the previous page contains each mass exactly once, thus, we'd expect the experimental spectrum to have a series of peaks all of approximately the same height.
In contrast, let's take peptide HCFI, and it's theoretical spectrum:
H
C
F
I
HC
CF
FI
IH
HCF
CFI
FIH
IHC
HCFI
137
103
147
113
137-103
103-147
147-113
113-137
137-103-147
103-147-113
147-113-137
113-137-103
137-103-147-113
137
103
147
113
240
250
260
250
387
363
397
353
500
The sorted list of "peaks" is: 103, 113, 137, 147, 240, 250, 250, 260, 353, 363, 387, 397, 500; indicating that the peak corresponding to mass 250 is twice as high as the other ones since there are two fragments (CF and IH) that both have the same mass.
We will say that the theoretical and experimental spectra match each other if, and only if, both the set of masses and their multiplicity match.
At this point we can propose a solution to the cyclopeptide sequencing problem.
How expensive is this algorithm? Note that for a given length L, the total number of peptides that have that length is as you can view each peptide as an L-length word formed from an alphabet with 20 letters. Of course, not all possible peptides can match the spectrum, and we can already constrain the set of possible peptides by noting that the mass of the peptide we are looking for is the largest mass in the experimental spectrum S. This allows us to constrain the length of the peptide, between and , i.e., by dividing the target mass by the mass of the largest amino-acid tryptophan (W) and by that of the smallest, glycine (G).
Note that rather than picking a particular estimate of length, we can explicitly calculate the number of peptides with a mass equal to a given target mass. We'll leave a dynamic programming solution to this problem as an exercise.
To avoid having to search the full set of possible peptides, we'll propose a different solution. What if we could enumerate partial solutions, then eliminate the ones we know cannot be the right answer. For example, let us return to the peptide SELF. We note that the sub-peptide EF is not part of the spectrum of this peptide, thus we can simply eliminate from consideration any putative peptide that contains the string EF.
You can take a moment to figure out how you may modify the brute force algorithm described above to take into account such short-cuts, but you'll find out that it's not quite that simple to wrap your mind around it.
Instead, I propose a simpler solution. What if we could "grow" a set of potential peptide fragments and check to see if they are compatible with the experimental spectrum. Once a peptide fragment is no longer compatible with the experimental spectrum, we know that none of the peptides containing it are compatible either, thus we can stop this expansion.
What do we mean by compatible? Earlier in this section, we noted that sub-peptide EF was not part of the spectrum for SELF, thus we eliminated it as a choice. More formally, we say that:
A sub-peptide P is incompatible with an experimental spectrum S, if there are masses in spectrum(P) that are not in S.
Note that "masses in spectrum(P) that are not in S" also implies situations where spectrum(P) contains more copies of a certain mass than the experimental spectrum S.
So, how can we use this idea? At a very high level, we will start with a list of possible sub-peptides, and check if any of them can be discarded because they are incompatible with the experimental spectrum. We will then add an additional amino-acid to each of the remaining ones, and again check whether the spectrum is compatible with S, and repeat until we find the answer.
For the example with the peptide SELF, we would start with a list containing each individual amino-acid: A, C, D, ..., Y. We would notice that only amino-acids S, E, L, and F are compatible with the spectrum, then attempt to "expand" them by adding an additional amino-acid to each. Note that we only need to consider one of the four amino-acids, because we already know the other ones are not compatible with the experimental spectrum. Thus, we create a new list: SS, SE, SL, SF, ES, EE, EL, EF, ..., FF. Within this list, we'll find out that only SE, EL, LF, FS are compatible with the experimental spectrum of SELF, and we again expand them and repeat the process until we identify four possible solutions: SELF, ELFS, LFSE, FSEL; or the four circular rotations of the original cyclic peptide.
More formally, the algorithm is as follows:
The implementation of this algorithm relies on several helper functions:
expand – takes each peptide in the current candidate list and "expands" it by adding each possible amino-acid to it, thus creating multiple new candidates
is_compatible – checks if the spectrum of a peptide is compatible with a given experimental spectrum
spectrum – computes the spectrum of a peptide
Also note that this algorithm produces multiple equivalent answers because our peptide is circular, and we are "rediscovering" it after starting at different places in the spectrum.
It is also important to observe that the algorithm can distinguish between "mis-spellings" of the peptide we are looking for (SELF in this case). Take for example the incorrect peptide SLEF. It has all the same amino-acids as SELF, as well as the same total mass, yet one of its sub-peptides, EF, is not found in the spectrum of SELF, thus telling us that SLEF is an incorrect guess. This last point indicates the power of this experimental design—if we fragment the peptide and measure it's spectrum with a mass spectrometer, we do get a lot of information about its sequence.
If you re-read the previous section more carefully, and examine the algorithm more closely, you will realize that the algorithm appears to be wrong. Let's assume we have the experimental spectrum for the peptide SELF: S, E, L, F, SE, EL, LF, FS, SEL, ELF, LFS, FSE, SELF; and that one of the partial peptides we have created is peptide SEL. Let's construct its spectrum:
spectrum(SEL) = S, E, L, SE, EL, LS, SEL
Notice a problem? The spectrum(SEL) contains sub-peptide LS which doesn't exist in experimental_spectrum(SELF). In other words, SEL is not compatible with SELF, which doesn't make sense.
Can you spot the problem and propose a solution?
Take a moment to do so before reading on.
The issue is that, so far, we only computed circular spectra, assuming the peptide we are analyzing is a cyclic peptide. When dealing with fragments of peptides, we are dealing with linear segments, thus the spectrum cannot contain any of the masses that "wrap around" the end of the fragment. Thus, we need to define a new function: linear_spectrum(P)
;
For SEL, linear_spectrum(SEL) = S, E, L, SE, EL, SEL
. Now all the masses are found in the experimental spectrum of SELF, thus the partial peptide SEL is no longer eliminated from consideration.
To summarize, we can adjust the algorithm as:
Note the two lines that compute a spectrum in the code above. In the first one, we check if the linear spectrum of a sub-peptide is compatible with the experimental spectrum. In the second one, we check whether the circular theoretical spectrum of the peptide matches exactly the experimental spectrum, indicating we found an answer.
As a final comment, you can gain some further speed improvement in the algorithm above, by performing a simple check: if the total mass of any sub-peptide exceeds the largest value in S, then we can eliminate that sub-peptide from consideration without computing any spectra. Adding more amino acids will only increase the size further away from the value we have observed in the experimental spectrum. We also, don't have to perform the second check (if the theoretical spectrum matches S) unless the largest mass in the two spectra are identical.
Just a quick aside to explain why we refer to the algorithm above as "branch and bound". You can view its execution as traversing a large tree. The "expand" function, simply adds another level to this tree, and the branches represent the different letters that extend each candidate peptide. In essence, we are building a trie that contains all possible peptides in it, and trim it off, bounding its growth, whenever a candidate peptide is no longer viable. Hence the name Branch and Bound (see the tree in the figure below).
To make things more concrete, let's take the mass of peptide HCFI, which is 500. When looking for HCFI, we have to look for peptides of sizes between L_{min} = 500/186 = 3$ amino-acids, and amino acids and amino acids. Let's assume an average size of 5 amino acids. There are different possible peptides, or 3,200,000 possible choices. As the length of the possible peptides increases, the number of possible choices grows quite fast. For example, at , we already have 1,2800,000,00 possible peptides. In other words, this algorithm rapidly becomes impractical.