Bioinformatics lecture notes
  • Introduction
  • Introduction to biology (for computer scientists)
  • Ethical considerations
  • Finding patterns in DNA
    • Introduction to pattern discovery
    • Looking for frequent k-mers
    • Leveraging biology
    • Finding genes
  • Exact string matching
    • Introduction to exact string matching
    • Semi-numerical matching
    • The Z algorithm
    • The KMP algorithm
  • Multiple sequence alignment
    • Introduction to multiple sequence alignment
    • Motif finding
  • String indexing
    • Introduction to string indexing
    • Introduction to suffix trees
    • Suffix trees: beyond the basics
    • Suffix arrays
    • The Burrows-Wheeler transform and the FM-index
  • Inexact alignment
    • Introduction to inexact alignment
    • Inexact alignment calculation with dynamic programming
    • Example: filling the dynamic programming table
    • Modeling alignment as a graph
    • Backtracking through the dynamic programming table
    • From edit distance to alignment scores
    • Local alignment
    • Exercises
  • Advanced inexact alignment
    • Gap penalties
    • Sequence alignment in linear space
    • Sequence alignment with bounded error
  • Proteomics data analysis
    • Introduction to proteomic data analysis
    • From peptides to theoretical spectra
    • Cyclopeptide sequencing
    • Dealing with errors in experimental spectra
  • Data clustering
    • Introduction to data clustering
    • K-means clustering
    • Hierarchical clustering
  • Phylogenetic analysis
    • Introduction to phylogenetic inference
    • Distance-based phylogenetic analysis
    • Trait-based phylogenetic inference
  • Sequence assembly
    • Introduction to sequence assembly
    • Graph formulations of sequence assembly
    • Finding Eulerian tours
  • Gene finding and annotation
    • Introduction to sequence annotation
    • Gene finding
    • Introduction to Hidden Markov Models
    • Taxonomic and functional annotation
Powered by GitBook
On this page
  • Introduction
  • Modeling peptide fragmentation
  1. Proteomics data analysis

From peptides to theoretical spectra

PreviousIntroduction to proteomic data analysisNextCyclopeptide sequencing

Last updated 4 months ago

Introduction

Before trying to figure out how to go from experimental signal to the sequence of a peptide, we need to make sure we can model the process that yields the experimental signal. For simplicity, everything that we discuss in this chapter is predicated on the assumption that the mass to charge ratios of fragments of peptides can be added up. For example we assume that the mass to charge ratio of the peptide AG is equal to the sum of the mass to charge ratios of the two amino-acids A and G. This is not necessarily true in the real world, but it's a reasonable approximation that allows us to come up with relatively simple algorithms.

In order to use a mass spectrometer to figure out the sequence of a protein/peptide, it is necessary to "blast" the peptide to pieces, otherwise all we would find out is its mass to charge ratio. The total mass to charge ratio of the peptide does provide some information that constrains the possible length of the peptide. For a total mass M, the shortest peptide that has that mass would be composed of a string of tryptophan (W) amino acids as tryptophan is the "heaviest" amino-acid with a mass to charge ratio of 186. In other words, Lmin=M/186L_{min} = M/186Lmin​=M/186. The longest peptide that has mass M would be composed of a string of glycine (G), the "lightest" amino acid with a mass to charge ratio of 57: Lmax=M/57L_{max} = M/57Lmax​=M/57.

However, for a given mass M, there could be many different sequences of amino acids that add up to M, i.e., just knowing the mass of the peptide M we cannot figure out its sequence. We argue that, by measuring the size of the fragments produced by breaking up the peptide in random pieces, we can constrain the set of possible sequences that are consistent with the mass M, in many cases resolving the sequence of the peptide of interest.

Modeling peptide fragmentation

First, some assumptions. We assume we are looking for the sequence of a peptide P of mass M. Our experiment starts with many copies of P which are fragmented at random, then the mass to charge ratio of the fragments is measured with a mass spectrometer. We will also assume that each copy of the peptide is fragmented in at most two pieces. This is not true in the real world, but this assumption simplifies the algorithms we will develop.

It's time for an example. Let's assume we have a circular peptide KLFPWFNQYV, shown in the figure below.

We assume that the fragmentation process breaks up the peptide in exactly two pieces. The process is random, thus copies of the peptide may break at different locations as shown below.

To make it easier to work through examples, we'll switch to a shorter peptide: SELF. Fragmenting this cyclic peptide results in the following set of fragments: S, E, L, F, SE, EL, LF, FS, SEL, ELF, LFS, FSE, SELF (assuming that multiple copies of this peptide have been broken up at all possible locations that yield two pieces each). Note that we considered fragments that "wrap around" the end of the string since the peptide is circular. We will call the set of fragments generated (or more precisely, their masses), the theoretical spectrum of the peptide. We call this spectrum "theoretical" because it represents what we expect to see in the output of the instrument, rather than the actual signal we have measured. Throughout the following, we will interchangeably refer to the spectrum as the set of sub-peptides, or the set of masses.

The following table includes the masses (or mass-to-charge ratios) of all amino acids.

G

A

S

P

V

T

C

I

L

N

D

K

Q

E

M

H

F

R

Y

W

57

71

87

97

99

101

103

113

113

114

115

128

128

129

131

137

147

156

163

186

Adding up these values, we can compute the masses of the theoretical spectrum of peptide SELF.

S

E

L

F

SE

EL

LF

FS

SEL

ELF

LFS

FSE

SELF

87

129

113

147

87-129

129-113

113-147

147-87

87-129-113

129-113-147

113-147-87

147-87-129

87-129-113-147

87

129

113

147

216

242

260

234

329

389

347

363

476

If you look carefully at the masses represented in the spectrum, you will see that amino acids I (isoleucine) and L (leucine) both have the same mass, as do K (lysine) and Q (glutamine). That means that we will never be able to use a mass spectrometer to figure out the exact sequence of a peptide that contains both members of these pairs. Nonetheless, you will see that we can learn a whole lot about peptides from their spectrum, even if some ambiguity will remain unsolved.

Cyclic peptide
Different ways to fragment a cyclic peptide. The peptide may remain intact (top left), or break in exactly two pieces: KLF and PWFNQYV, VKLFP and YQNFW, and P and WFNQYVKLF.
Letters organized in a circle representing a cyclic peptide.
Multiple circles containing letters representing circular peptides. Some circles are broken at two locations to demonstrate the random breakage process.