Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Scoring matches between spectra
  • Brute force solution to cyclopeptide sequencing
  • Branch and bound solution to cyclopeptide sequencing
  • Cyclical versus linear spectra
  • Branch and bound
  1. Proteomics data analysis

Cyclopeptide sequencing

PreviousFrom peptides to theoretical spectraNextDealing with errors in experimental spectra

Last updated 4 months ago

Now that we know how to estimate the theoretical spectrum of a peptide, we can finally describe the goal of this chapter. Assume you are a scientist who has managed to purify a cyclic peptide but who doesn't know the sequence of amino-acids forming it. You "measure" the peptide with a mass spectrometer, using the process described above which first breaks the peptide up into small pieces and obtain an experimental spectrum for the peptide. Does the experimental spectrum contain enough information to allow you to figure out the sequence of letters forming your peptide? Specifically, you are trying to solve the following problem:

Cyclopeptide sequencing: Given an experimental spectrum S (a sorted list of masses and their relative abundance as measured by an instrument), identify a cyclic peptide P such that spectrum(P) = S.

Scoring matches between spectra

First, we need to decide what it means for two spectra to match. As we saw in the , the mass spectrum contains both a set of masses, and how frequent they are in the spectrum. We can glean the same information from the theoretical spectrum. The spectrum for peptide SELF shown in the previous page contains each mass exactly once, thus, we'd expect the experimental spectrum to have a series of peaks all of approximately the same height.

In contrast, let's take peptide HCFI, and it's theoretical spectrum:

H

C

F

I

HC

CF

FI

IH

HCF

CFI

FIH

IHC

HCFI

137

103

147

113

137-103

103-147

147-113

113-137

137-103-147

103-147-113

147-113-137

113-137-103

137-103-147-113

137

103

147

113

240

250

260

250

387

363

397

353

500

The sorted list of "peaks" is: 103, 113, 137, 147, 240, 250, 250, 260, 353, 363, 387, 397, 500; indicating that the peak corresponding to mass 250 is twice as high as the other ones since there are two fragments (CF and IH) that both have the same mass.

We will say that the theoretical and experimental spectra match each other if, and only if, both the set of masses and their multiplicity match.

Brute force solution to cyclopeptide sequencing

At this point we can propose a solution to the cyclopeptide sequencing problem.

given S – an experimental spectrum
for P in all possible peptides:
  check if theoretical_spectrum(P) == S

How expensive is this algorithm? Note that for a given length L, the total number of peptides that have that length is 20L20^L20L as you can view each peptide as an L-length word formed from an alphabet with 20 letters. Of course, not all possible peptides can match the spectrum, and we can already constrain the set of possible peptides by noting that the mass of the peptide we are looking for is the largest mass in the experimental spectrum S. This allows us to constrain the length of the peptide, between Lmin=max⁡(S)/186L_{min} = \max(S)/186Lmin​=max(S)/186 and Lmax=max⁡(S)/57L_{max} = \max(S)/57Lmax​=max(S)/57, i.e., by dividing the target mass by the mass of the largest amino-acid tryptophan (W) and by that of the smallest, glycine (G).

Note that rather than picking a particular estimate of length, we can explicitly calculate the number of peptides with a mass equal to a given target mass. We'll leave a dynamic programming solution to this problem as an exercise.

Branch and bound solution to cyclopeptide sequencing

To avoid having to search the full set of possible peptides, we'll propose a different solution. What if we could enumerate partial solutions, then eliminate the ones we know cannot be the right answer. For example, let us return to the peptide SELF. We note that the sub-peptide EF is not part of the spectrum of this peptide, thus we can simply eliminate from consideration any putative peptide that contains the string EF.

You can take a moment to figure out how you may modify the brute force algorithm described above to take into account such short-cuts, but you'll find out that it's not quite that simple to wrap your mind around it.

Instead, I propose a simpler solution. What if we could "grow" a set of potential peptide fragments and check to see if they are compatible with the experimental spectrum. Once a peptide fragment is no longer compatible with the experimental spectrum, we know that none of the peptides containing it are compatible either, thus we can stop this expansion.

What do we mean by compatible? Earlier in this section, we noted that sub-peptide EF was not part of the spectrum for SELF, thus we eliminated it as a choice. More formally, we say that:

A sub-peptide P is incompatible with an experimental spectrum S, if there are masses in spectrum(P) that are not in S.

Note that "masses in spectrum(P) that are not in S" also implies situations where spectrum(P) contains more copies of a certain mass than the experimental spectrum S.

So, how can we use this idea? At a very high level, we will start with a list of possible sub-peptides, and check if any of them can be discarded because they are incompatible with the experimental spectrum. We will then add an additional amino-acid to each of the remaining ones, and again check whether the spectrum is compatible with S, and repeat until we find the answer.

For the example with the peptide SELF, we would start with a list containing each individual amino-acid: A, C, D, ..., Y. We would notice that only amino-acids S, E, L, and F are compatible with the spectrum, then attempt to "expand" them by adding an additional amino-acid to each. Note that we only need to consider one of the four amino-acids, because we already know the other ones are not compatible with the experimental spectrum. Thus, we create a new list: SS, SE, SL, SF, ES, EE, EL, EF, ..., FF. Within this list, we'll find out that only SE, EL, LF, FS are compatible with the experimental spectrum of SELF, and we again expand them and repeat the process until we identify four possible solutions: SELF, ELFS, LFSE, FSEL; or the four circular rotations of the original cyclic peptide.

More formally, the algorithm is as follows:

BBCyclopeptideSequencing(S)  # S is an experimental spectrum
candidate_peptides = []
while True
  candidate_peptides = expand(candidate_peptides) # add each potential 									  # letter to end of each 									  # peptide
  for cp in candidate_peptides
     if cp is incompatible with S
        remove cp from candidate_peptides

  for cp in candidate_peptides
     if theoretical_spectrum(cp) == S
        output cp
        remove cp from candidate_peptides

  if candidate_peptides is empty
     exit

end while

The implementation of this algorithm relies on several helper functions:

  • expand – takes each peptide in the current candidate list and "expands" it by adding each possible amino-acid to it, thus creating multiple new candidates

  • is_compatible – checks if the spectrum of a peptide is compatible with a given experimental spectrum

  • spectrum – computes the spectrum of a peptide

Also note that this algorithm produces multiple equivalent answers because our peptide is circular, and we are "rediscovering" it after starting at different places in the spectrum.

It is also important to observe that the algorithm can distinguish between "mis-spellings" of the peptide we are looking for (SELF in this case). Take for example the incorrect peptide SLEF. It has all the same amino-acids as SELF, as well as the same total mass, yet one of its sub-peptides, EF, is not found in the spectrum of SELF, thus telling us that SLEF is an incorrect guess. This last point indicates the power of this experimental design—if we fragment the peptide and measure it's spectrum with a mass spectrometer, we do get a lot of information about its sequence.

Cyclical versus linear spectra

If you re-read the previous section more carefully, and examine the algorithm more closely, you will realize that the algorithm appears to be wrong. Let's assume we have the experimental spectrum for the peptide SELF: S, E, L, F, SE, EL, LF, FS, SEL, ELF, LFS, FSE, SELF; and that one of the partial peptides we have created is peptide SEL. Let's construct its spectrum:

spectrum(SEL) = S, E, L, SE, EL, LS, SEL

Notice a problem? The spectrum(SEL) contains sub-peptide LS which doesn't exist in experimental_spectrum(SELF). In other words, SEL is not compatible with SELF, which doesn't make sense.

Can you spot the problem and propose a solution?

Take a moment to do so before reading on.

The issue is that, so far, we only computed circular spectra, assuming the peptide we are analyzing is a cyclic peptide. When dealing with fragments of peptides, we are dealing with linear segments, thus the spectrum cannot contain any of the masses that "wrap around" the end of the fragment. Thus, we need to define a new function: linear_spectrum(P);

For SEL, linear_spectrum(SEL) = S, E, L, SE, EL, SEL. Now all the masses are found in the experimental spectrum of SELF, thus the partial peptide SEL is no longer eliminated from consideration.

To summarize, we can adjust the algorithm as:

BBCyclopeptideSequencing(S)  # S is an experimental spectrum
candidate_peptides = []
while True
  candidate_peptides = expand(candidate_peptides) # add each potential 									  # letter to end of each 									  # peptide
  for cp in candidate_peptides
     if linear_spectrum(cp) is incompatible with S
        remove cp from candidate_peptides

  for cp in candidate_peptides
     if theoretical_spectrum(cp) == S
        output cp
        remove cp from candidate_peptides

  if candidate_peptides is empty
     exit

end while

Note the two lines that compute a spectrum in the code above. In the first one, we check if the linear spectrum of a sub-peptide is compatible with the experimental spectrum. In the second one, we check whether the circular theoretical spectrum of the peptide matches exactly the experimental spectrum, indicating we found an answer.

As a final comment, you can gain some further speed improvement in the algorithm above, by performing a simple check: if the total mass of any sub-peptide exceeds the largest value in S, then we can eliminate that sub-peptide from consideration without computing any spectra. Adding more amino acids will only increase the size further away from the value we have observed in the experimental spectrum. We also, don't have to perform the second check (if the theoretical spectrum matches S) unless the largest mass in the two spectra are identical.

Branch and bound

Just a quick aside to explain why we refer to the algorithm above as "branch and bound". You can view its execution as traversing a large tree. The "expand" function, simply adds another level to this tree, and the branches represent the different letters that extend each candidate peptide. In essence, we are building a trie that contains all possible peptides in it, and trim it off, bounding its growth, whenever a candidate peptide is no longer viable. Hence the name Branch and Bound (see the tree in the figure below).

To make things more concrete, let's take the mass of peptide HCFI, which is 500. When looking for HCFI, we have to look for peptides of sizes between Lmin=500/186=3L_{min} = 500/186 = 3Lmin​=500/186=3L_{min} = 500/186 = 3$ amino-acids, and amino acids and Lmax=500/57=8L_{max} = 500/57 = 8Lmax​=500/57=8 amino acids. Let's assume an average size of 5 amino acids. There are 20520^5205 different possible peptides, or 3,200,000 possible choices. As the length of the possible peptides increases, the number of possible choices grows quite fast. For example, at L=7L=7L=7, we already have 1,2800,000,00 possible peptides. In other words, this algorithm rapidly becomes impractical.

introduction
Branch and bound execution of the cyclopeptide sequencing algorithm for peptide SELF. Only the left-most path (starting with S) is detailed. The letters highlighted with a box are eliminated from further consideration because the corresponding sub-peptides are incompatible with the spectrum of SELF. Out of the 84 possible sub-peptides starting at S, the algorithm only explores 20.
Search tree representation of the peptide branch and bound approach. Boxes around letters indicate which paths are terminated because the spectra do not match the experimental spectrum.