PROTEOMICS SEARCH ENGINES

This blog post is a high-level introduction to what a search engine is, in mass spectrometry-based proteomics (MS-proteomics).

For those not familiar with proteomics jargon, the abbreviations used throughout the text are:

Data-dependent acquisition (DDA)

Data-independent acquisition (DIA)

Mass spectrometry-based proteomics (MS-proteomics), AKA shotgun proteomics

Peptide-spectrum match (PSM)

Target-decoy strategy (TDS)

Trans-Proteomic Pipeline (TPP)

SOME NOTES BEFORE WE START

In MS-proteomics, two data collection approaches have been developed so far:

Data-dependent acquisition (DDA)
Data-independent acquisition (DIA)

In this and the following blogposts, I will focus on the DDA approach, and for convenience, will refer to it as shotgun proteomics (Figure 1).

Below is a figure with an oversimplified explanation of a typical bottom-up shotgun proteomics. I have explained what bottom-up proteomics is in my previous blog post: “The Pre-Proteomics Era”.

I will use the terms “search engine” and “proteomics platform” throughout this blog post.

A search engine is an algorithm used to identify peptide sequences from uninterpreted mass spectrometry dataset. This dataset contains, among other things, precursor and fragment ion masses, along with their corresponding relative abundances.

A proteomics platform is a suite of algorithms, which can include one or more search engines, often interfaced with statistical analysis tools, and visualization algorithms.

The most popular proteomics platforms and their corresponding search engines are:

Proteome Discover™ — SEQUEST™ HT and Mascot®.
Crux — Comet and X!Tandem.
MaxQuant — Andromeda.
FragPipe — MSFragger.
Trans-Proteomic Pipeline — Comet, SEQUEST, Mascot®, and X!Tandem.

Figure 1. Bottom-up shotgun proteomics with data collected in DDA mode. The bottom-up approach is the most common shotgun proteomics. workflow, and for it to happen, proteins must be digested with a site-specific protease (typically trypsin). Shown in this cartoon is a simplified illustration of a capillary column with an electrospray ionization tip. In DDA mode, the mass spectrometer scans the peptides ionized (precursor ions) entering the instrument and selects the most abundant ones for fragmentation in a high energy collision dissociation (HCD) cell filled with neutral gas molecules (e.g., Argon) — other dissociation strategies can be used. The precursor ions selected for fragmentation are isolated when passing through a quadrupole with a narrow isolation window (0.5-1.5 Da). The precursor ion isotopic cluster and the fragment ion masses are determined by a mass analyzer, in this case an Orbitrap. The cartoons of peptide sequences shown in the bottom left of the figure are hypothetical examples of tryptic peptides, one of them bearing an alkylated cysteine. The red line in the precursor ion isotopic cluster inset indicates the monoisotopic ion isolated by the mass spectrometer for downstream HCD fragmentation.

SEQUEST

The SEQUEST algorithm published in 1994, was the first computational tool developed for the automated analysis of shotgun proteomics data.

“An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database” — 1994.

The concept was simple and elegant. The algorithm compared the fragment ion masses predicted from theoretical peptides in a protein database, to the ones measured in a shotgun proteomics experiment. The best peptide-spectrum matches (PSMs) were shortlisted, based on a significance score (cross-correlation score), and the corresponding peptide sequences used to infer protein identifications. Figure 2 below is a simplified explanation of the PSM workflow.

At the time of its publication, SEQUEST represented a major advancement in computational shotgun proteomics. Conceptually, the database-dependent PSM approach paved the way for the many search engines to come.

Figure 2. Simplified description of the PSM workflow. The PSM concept pioneered by SEQUEST can be explained in seven steps. These steps are performed iteratively for all the experimental precursor ions (MS1) and their associated fragmentation pattern (MS2) in an input raw data file. Step 1 - MS1 mass and charge value determination. The mass and charge values of each MS1 spectra extracted from the input dataset are calculated. Step 2 - Theoretical MS1 prediction. The proteins in the proteome database are digested in silico assuming trypsin digestion specificity and their mass value and charge calculated (a different specificity is used if a protease other than trypsin was used in the experiment) . Step 3 - Set of theoretical MS1 masses per experimental MS1. Theoretical peptides with mass values significantly close to those of each experimental MS1 are selected for downstream MS2 fragmentation prediction. Step 4 - MS2 prediction. The amino acid sequences associated to the theoretical MS1 selected in Step 3 undergo in silico fragmentation as per the chemical principles assumed by the Mobile Proton Hypothesis. “The Mobile Proton Hypothesis in Fragmentation of Protonated Peptides: A Perspective” — 2010. Step 5- PSM. Each experimental MS2 spectra is cross-correlated to the set of in silico MS2 spectra obtained from steps 3 and 4. Step 6 - Peptide sequence assignment. The amino acid sequence associated to the best PSM in step 6 is assigned to the corresponding experimental MS1 . Step 7 - Protein inference. The peptide sequences obtained are used to infer protein identities. Conceptually, the protein inference problem is nontrivial due to the presence of protein isoforms and homologues, which share overlapping peptide sequences.

In the figure, the amino acids K/R are highlighted red to indicate that the peptides sequences are tryptic. In the left panel, the middle inset labeled PSM is composed of red, black and blue lines. The black and blue lines correspond to experimental and predicted spectra, respectively. The red lines indicate ion masses that matched in the PSM step. The right panel shows a hypothetical example of the "protein inference problem". The experimental and predicted MS2 spectra are compared in mirror image for convenience.

Despite its innovative value, the first version of SEQUEST had shortcomings, the most important being:

A lack of a score function to determine the statistical probability that a significant PSM was obtained by chance
Long PSM computation times

Many improvements to the SEQUEST algorithm were made in the following years. By 2007, SEQUEST was significantly faster than the prototype, it could be parallelized, and its PSM scoring model had been refined.

These improvements were implemented by the private sector and academia alike, resulting in two widely used versions of SEQUEST.

SEQUEST™ HT, which is the default search engine in Proteome Discoverer™, a proteomics platform distributed by Thermo Fisher Scientific.

“Protein identification using TurboSEQUEST” — 2005

Comet, an optimized version of SEQUEST, which runs under two different platforms: Crux, and the Trans-Proteomic Pipeline (TPP).

“Comet: an open-source MS/MS sequence database search tool” — 2013

A summary of SEQUEST’s evolution is nicely described in the review article

“The SEQUEST Family Tree” — 2015

It is important to mention that when the SEQUEST algorithm was published in 1994, fully sequenced genomes were not available. GenePept, the database used in the publication, had a limited number of curated gene sequences from Saccharomyces cerevisiae and Escherichia coli.

“An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database” — 1994.

This all changed when the full genome of these two microbes was completed, years later (1995 and 1997, respectively). The completion of a eukaryotic genome had to wait until 2000. The human genome in its revised version until 2003.

PROBABILISTIC SEARCH ENGINES

The “probabilistic generation” of search engines started 1999, with the publication of Mascot®, developed by Matrix Science.

“Probability-based protein identification by searching sequence databases using mass spectrometry data” — 1999

Mascot® inherited the cross-correlational concept for the PSM from SEQUEST. What changed was the implementation of a probability score to estimate the statistical significance of the PSM peptide sequence assignments. This and other probability scores developed thereafter to estimate with the probability that a PSM had occurred by chance.

Mascot® quickly became popular, due to its probabilistic score, and because it could be parallelized with cluster computing and had user-friendly visualizations tools.

The problem with Mascot® was that the details of its probabilistic model were not disclosed, impeding an objective evaluation of the algorithm’s performance.

In response to Mascot’s undisclosed algorithm, several open-source probabilistic search engines appeared during the early 2000s.

Table 1 below summarizes the most popular search engines, based on the cross-correlational concept pioneered by SEQUEST.

ERROR RATE CONTROL

Once Mascot® became popular and as efforts to refine the statistical scoring model embedded in SEQUEST proceeded, it became clear that much more needed to be done to better control peptide and protein error rates.

Important improvements to search engine workflows were developed during 2002-2007.

Machine learning algorithms for PSM rescoring

PeptideProphet
“Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search” — 2002

ProteinProphet
“A Statistical Model for Identifying Proteins by Tandem Mass Spectrometry” — 2003

Percolator
“Semi-supervised learning for peptide identification from shotgun proteomics datasets” —2007
Concatenated target decoy strategy (TDS) to control the false discovery rate during the PSM search.
“Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry” — 2007
The Trans-Proteomic Pipeline (TTP), a suite of algorithms for the statistical validation of search engine results based on the XML file format.

“A uniform proteomics MS/MS analysis platform utilizing open XML file formats” — 2005

It should be noted that the TPP was the first proteomics platform to centralize the use of machine learning algorithms, like PeptideProphet, ProteinProphet and Percolator, and many other tools for statistical data exploration and validation. The platform is still in use and is compatible with many search engines, including Comet, SEQUEST, Mascot® and X!Tandem. http://www.tppms.org/

MAXQUANT/ANDROMEDA

The proteomics platform MaxQuant, was developed in 2008 to take full advantage of high-mass accuracy and resolution raw data, which started to become available after the commercialization of the Orbitrap mass analyzer in 2005.

“MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification” — 2008

“The Orbitrap: a new mass spectrometer” — 2005

Initially, MaxQuant was bundled to Mascot®, for PSM searches. The MaxQuant/Mascot® pipeline was used to analyze the first fully sequenced proteome: Saccharomyces cerevisiae.

“Comprehensive mass-spectrometry-based proteome quantification of haploid versus diploid yeast” — 2008

Bundling Mascot® to MaxQuant was nontrivial, limiting its use to a handful of reference laboratories.

The popularity of MaxQuant started in 2011, with the development of Andromeda, a search engine with unique features. The MaxQuant/Andromeda pipeline quickly became widely used in the field, given its characteristics and user-friendly format. It was used in 2011 to analyze the Human Proteome Draft.

“Andromeda: a peptide search engine integrated into the MaxQuant environment” — 2011

FRAGPIPE/MSFRAGGER

The proteomics platform FragPipe and its search engine MSFragger were published in 2017. FragPipe has continued to be improved and is currently the among the few traditional search engine featuring deep learning algorithms. https://fragpipe.nesvilab.org/

“MSFragger: ultrafast and comprehensive peptide identification in shotgun proteomics” — 2017

FragPipe can be consider a hybrid of the Trans-Proteomics Pipeline and MaxQuant, primarily for three reasons:

It takes in raw data in the XML file format
It uses the machine learning algorithms PeptideProphet, ProteinProphet and Percolate to refine the peptide-spectrum match score
It takes advantage of isotopic resolution to extract precursor ion features and recalibrate masses

The FragPipe/MSFragger pipeline has two unique attributes:

Ultrafast PSM searches by means of fragment ion indexing
Can perform open search PSMs to discover unexpected peptide modifications

THE PROTEOME’S MANY IMAGINARY FRIENDS

When stress tested with unusually large datasets or in “open search” mode, Proteome Discoverer, MaxQuant and FragPipe have —each in a different way— opened a pandora box of unanticipated issues with false discovery rate discrimination.

MaxQuant and Proteome Discoverer™.

The completion of the Human Proteome Draft by two independent teams in 2014, one using MaxQuant, and the other Proteome Discoverer™, revealed weaknesses in false discovery calculations, especially at the protein level, when analyzing very large datasets.

“A draft of the human proteome” — 2014

“Mass-spectrometry-based draft of the human proteome” — 2014

FragPipe.

The publication describing MSFragger revealed contradictory peptide identification results when “open” and “closed PSM workflows were compared. These results suggested weaknesses in the target-decoy search concept for FDR control, widely adopted since its proposal in 2007.

“MSFragger: ultrafast and comprehensive peptide identification in shotgun proteomics” — 2017

The above has reignited a reevaluation of the false discovery rate estimation methods, including an alternative to the decoy strategy currently used.

Recent preprints in BioRxiv address this issue

https://www.biorxiv.org/content/10.1101/2024.06.01.596967v2

https://www.biorxiv.org/content/biorxiv/early/2023/04/08/2023.04.07.535980.full.pdf

This a complex and very important topic, and addressing it is nontrivial, which deserves to be discussed in a separate blogpost.

Stay tuned!

GPR

PROTEOMICS SEARCH ENGINES

THE PROTEOME’S MANY IMAGINARY FRIENDS

Recent Posts

Comments