THE PEPTIDE-SPECTRUM MATCH

In my previous blogpost “PROTEOMICS SEARCH ENGINES” I wrote about the current state-of-the-art of the algorithms used to analyze shotgun proteomics data, better known as search engines. I focused my discussion on the search engines used to analyze shotgun proteomics spectra collected in data-dependent acquisition (DDA) mode.

In this second delivery, I will resume discussing the statistical approaches leveraged for error rate control in peptide sequence assignment, from spectra collected in DDA mode.

For those not familiar with proteomics jargon, the abbreviations used throughout the text are:

Data-dependent acquisition (DDA)

False discovery rate (FDR)

Fragment ion masses (MS2)

Mass spectrometry-based proteomics (MS-proteomics), AKA shotgun proteomics

Peptide-spectrum match (PSM)

Post-translational modification (PTM)

Precursor ion (MS1)

Target-decoy competition (TDC)

Target-decoy strategy (TDS)

THE ORBITRAP ERA OF SHOTGUN PROTEOMICS

Orbitrap mass analyzers, which appeared in 2005, were a game changer, kicked starting the era of high mass accuracy and accurate resolution in MS-proteomics data collection.

“The Orbitrap: a new mass spectrometer” — 2005

Orbitrap instruments boosted an increase in proteome coverage and protein sequence depth, and provided the level of resolution and mass accuracy, required for quantitative proteomics applications.

“The coming age of complete, accurate, and ubiquitous proteomes” — 2013

The field also benefited from the development of MaxQuant (2008) and its in-built search engine Andromeda (2011)—the first proteomics platform dedicated to high-mass accuracy and accurate resolution datasets. MaxQuant also provided, for the first time, a platform that could preprocess large volumes of spectra with isotopic resolution.

“MaxQuant enables high peptide identification rates, individualized p.p.b.-range mass accuracies and proteome-wide protein quantification” — 2008

“Andromeda: a peptide search engine integrated into the MaxQuant environment” — 2011

The era of high-mass accuracy and accurate resolution in shotgun proteomics soon led to two markedly breakthroughs.

1. The first publications reporting ~10,000 proteins in cell lines using shotgun proteomics.

“Deep proteome and transcriptome mapping of a human cancer cell line” — 2011

“Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins” — 2012

2. The publication of a draft of the human proteome from the analysis of thousands of shotgun proteomics datasets collected from different tissues.

“A draft of the human proteome” — 2014

“Mass-spectrometry-based draft of the human proteome” — 2014

The sudden surge of large-scale datasets raised an alarm among some experts in the proteomics community.

It was suggested that the very nature of high-throughput proteomics and the impracticality of manual verification of the results, could lead to an accumulation of false discovery identifications in published datasets.

“Target-decoy approach and false discovery rate: when things may go wrong” — 2011

“The potential cost of high-throughput proteomics” — 2012

“The problem with peptide presumption and the downfall of targe-decoy false discovery rates” — 2012

The tipping point were the two human proteome drafts mentioned above. A reanalysis of the data in these two publications revealed weaknesses in the target-decoy strategy (TDS) implemented for false discovery rate (FDR) estimation, primarily at the protein level.

“Analyzing the First Drafts of the Human Proteome” — 2014

“The potential clinical impact of the release of two drafts of the human proteome” — 2015

But science is, in most cases, self-corrective.

Innovative adjustments to the traditional TDS were soon proposed, and the subject continues to be a focus of attention.

“A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets” — 2015

Before discussing this matter any further, we must first understand three concepts, which I will address below.

Peptide-spectrum match (PSM)
Target-decoy strategy (TDS)
False discovery rate (FDR)

THE PEPTIDE-SPECTRUM MATCH

Centerfold to the analysis of shotgun proteomics data are the PSM and the TDS.

A proteomics spectra dataset is composed of mass/charge (m/z) values of the peptide ions (MS1) and their corresponding fragment ions (MS2), and among other things, their abundance intensities.

During the PSM workflow, search engines extract MS1 and MS2 features from the dataset, and compare them to a repertoire of in silico counterparts, which are predicted from a pre-defined reference proteome database.

The reference proteome database is composed of protein sequences (target) from the proteome of interest. These are concatenated to a collection of in silico reversed or reshuffled protein sequences (decoys) of each target protein.

The PSM search is prone to erroneous peptide sequence assignments, even when using stringent probability scores to estimate the probability that a PSM is significant.

Ideally, the error rate would be corrected with a manual inspection of the results, but this task is unfeasible in shotgun proteomics, considering the volume of information generated.

Instead, search engines make use of the TDS, which is an indirect way of estimating the FDR. While not perfect, this approach is considered by many, the best approach available in shotgun proteomics to calculate the PSM error rate.

The TDS is based on the following assumptions:

PSMs to target and decoy sequences have an equal probability of occurrence
PSMs to decoy sequences are infrequent random events
Decoy sequences can be thought of as surrogates of false positive hits

Based on these assumptions, it is possible to estimate the FDR by dividing the number of decoys PSMs by the portion of target PSMs.

This is not the only way one can estimate the FDR, though it is the most common.

Figure 1 below provides a schematic explanation of how the PSM workflow when coupled to the TDS, helps estimating the FDR.

Figure 1. The PSM workflow. The search engine extracts theoretical peptides from the target-decoy database and predicts their fragmentation patterns, based on specified protease specificity, mass shifts induced by amino acid modifications, and collisional fragmentation rules. I explain the PSM workflow as having three steps. Step 1 - Theoretical mass selection. Theoretical peptides are chosen for the PSM workflow if their masses match the one calculated for an MS1 spectrum. A narrow mass tolerance window (5-10 ppm) is used for this to assure specificity. Step 2 - PSM prediction. Experimental MS2 spectra are predicted from the selected theoretical peptides. Step 3 - The target-decoy competition (TDC). PMS from target and decoy MS2 spectra receive a probabilistic score and the one with the highest value is chosen for peptide sequence assignment. To estimate the FDR, the number of decoy matches is divided by the proportion of target ones. The estimated FDR is used to establish a probability score threshold in the target-decoy competition workflow. If for example, a 1% FDR is desired, then the PSM score cutoff will be one that only allows 1% of decoy PSMs.

A CHAIN OF CHANCE

It is important to point out that while error rate control is purely a computational procedure, a lot can be done during sample preparation, mass spectrometry data collection, and pre-processing feature extraction, to help minimize the occurrence of spurious PSM errors during data analysis.

Table 1 below summarizes the many factors that may —alone or in combination— be responsible for an erroneous PSM.

Shotgun proteomics workflows are littered with erroneous feature detection events during the data analysis process. Amongst these, wrong precursor ion charge calculation is responsible for a large amount of erroneous PSMs (false positive identifications).

Three extreme examples are when i. the precursor ion charge is wrongly calculated; ii. the mass of an abundant and high-scoring modified peptide happens to be the same of an unmodified theoretical peptide in the target proteome database; or iii. a chimeric MS2 confound the PSM.

As I mentioned above, PSM errors —false positive identifications— are infrequent random events. However, in an unlucky scenario, one random event can lead to another, and another, and another. Now imagine many random events happening near-simultaneously?

It can therefore be said that large datasets, especially those produced from a poorly prepared sample (low quality spectra an unusually high amount of chemical modifications) are something like a chain of chance.

Figure 2 below depicts the PSM in the context of the abovementioned in this section.

Figure 2. The PSM workflow in the context of infrequent random events.

This figure depicts many of the circumstances in which the PSM workflow becomes error-prone. The left panel shows three cartoons. The top one is an illustration of the features extracted from a precursor ion isotopic envelop (monoisotopic ion, and charge determination). The two other cartoons depict common peptide sequence modifications in eukaryotic and bacterial proteomes. The right panel depicts the PSM workflow in the context of chimeric MS2, which is shown at the bottom of the figure (red an black vertical lines represent fragment masses from a different MS2). To illustrate the cross-correlational PSM, the mirror comparison of the experimental and predicted MS2 fragment masses is shown in the middle. In this case, blue and red lines indicate the matching masses. The experimental MS2 is inverted.

There are many opportunities for deep learning models to overcome the caveats in shotgun proteomics data analysis I have mentioned above. Before addressing this subject in a future blogpost, I must talk to you about data loss and ways to control it in shotgun proteomics.

For now I hope this post is clear enough.

Stay tuned!

GPR

Disclosure: At BioTech Writing and Consulting we believe in the use AI in Data Science, but do not use AI to generate text or images.