DATA LOSS AND WAYS TO CONTROL IT
- Genaro Pimienta
- Aug 1, 2024
- 8 min read
This blogpost is a postscript to the previous one —“THE PEPTIDE-SPECTRUM MATCH”—, in which I wrote about the peptide-spectrum match (PSM) concept for peptide sequence assignment, and the target-decoy strategy (TDS) for false discovery rate estimation. I also discussed the inadvertent contribution of spectra from modified peptides to false positive identifications.
In this blogpost I again write about spectra from modified peptides, but in the context of false negative identifications and data loss in proteomics data analysis.
I also present below a brief outline of the so-called "error-tolerant" and "open search" algorithms, which have been developed to overcome the computational bottlenecks in traditional search engines when set to search for unsuspected peptide chemical and post-translational modifications (PTMs).
For those not familiar with proteomics jargon, the abbreviations used throughout the text are:
Fragment ion spectra (MS2)
Peptide-spectrum match (PSM)
Post-translational modification (PTM)
Target-decoy competition (TDC)
Target-decoy strategy (TDS)
I will still define these terms throughout the text.
DATA LOSS
Data loss in shotgun proteomics refers to the portion of fragment ion spectra (MS2) in a dataset left unassigned after data analysis is completed.
More than 70% of the MS2 spectra in a dataset remains unassigned, and a large portion of it are false negative identifications (hits).
Key features in false negative hits are:
Spectra with an informative fragmentation pattern
A correctly identified precursor ion charge and mass
A significant PSM score
Common sources of false negative hits are:
Unspecific protease cleavage sites
Unsuspected peptide modifications (chemical or post-translational)
Amino acid substitutions (mutations or natural variants)
Chimeric spectra from co-eluting precursor ions
SIDE CHAIN PEPTIDE MODIFICATIONS
Side chain peptide modifications can be chemical or post-translational (Figure 1).
Chemical modifications are spontaneous and reflect the reactivity of the chemical moieties on specific amino acid side chains during sample preparation.
Furthermore, chemical modifications can also occur in a biological context when, for example, secreted proteins are exposed to an imbalanced REDOX potential in plasma.
Post-translational modifications (PTMs) on the other hand, are enzyme-mediated, and have a regulatory function. PTMs are often transient and have low stoichiometry.
The cartoons in Figure 1 below provide a summary of the most common peptide modifications found in eukaryotic and bacterial proteomes. This figure is an adaptation of Figure 2 in my previous blogpost —“PSM: a Chain of Chance”.

Figure 1. Most common peptide modifications.
THE PTM IDENTIFICATION PROBLEM
To obtain a complete picture of the modification landscape in a proteome, a search engine must consider every possible PTM on all the putative modification sites in the collection of theoretical peptide sequences derived from the proteome under investigation.
When using traditional search engines, an unrestrictive search for all possible PTMs in a proteome is unfeasible, primarily for two reasons:
Prohibitive data analysis completion times
Loss of peptide identification sensitivity due to an imbalance in the FDR estimation strategy
What happens is that the number of predicted MS2 spectra considered in an unrestricted search for PTMs is such, that it quickly results in a substantial increase of decoy hits (false positive hits).
To keep the FDR at the indicated value (typically 1%), the target-decoy strategy (TDS) increases the score threshold for a PSM to be considered significant (loss in sensitivity).
By doing this, many significant PSMs go unassigned (false negative hits), and end up in the data loss bin.
Figure 2 below provides a simplified explanation of the limitations mentioned in the above paragraph.

Figure 2. Shown in this figure are the basic components of the PSM in traditional search engines. The target-decoy competition TDC in the PSM workflow determines if the best-scoring PSM is a target or a decoy assignment. PTMs have low stoichiometry, meaning that their is always going to be a proportion of unmodified peptide. When multiple PTMs are searched as "variable" (only a portion of the peptide is modified) in the analysis, there is an increase in the number of predicted MS2 possibilities considered in the PSM workflow. When this happens, the chance (possibility) that a "random" PSM is assigned to a decoy peptide by the TDC step increases. To control the FDR, the TDS raises the score required for a PSM to be considered significant. At a higher score threshold, there is an increase in peptide sequence assignment specificity. PSM stringency however, has a cost, and many true positive hits are left unassigned.
THE CURRENT SOLUTION TO THE PTM IDENTIFICATION PROBLEM
A great number of innovative search engines have been developed for the unrestricted identification of PTMs.
When referring to unrestricted search engines in the literature, there is ambiguity in the use of the terms “error-tolerant”, “mass tolerant”, “blind” , and “open”. Discussing this matter is beyond the scope of this blogpost, so I will not elaborate any further.
What makes each of these search engines unique is the type of PSM algorithm used:
Peptide sequence tag
Open search
De novo
Spectral alignment
In the following two sections, I provide a brief description of the peptide sequence tag and open search approaches in the context of ultrafast search engines.
Table 1 below provides a list of most search engines documented so far, with a specification of the type of PSM algorithm implemented.

Table 1 is a list of the most relevant search engines developed so far for the unrestricted identification of unsuspected peptide modifications. The first two columns indicate the algorithm name and its publication date. These are followed by four columns, which indicate the type of PSM strategy: cross-correlation, peptide sequence tag, de novo and spectral alignment. The last column indicates which algorithms include a fragment ion index method.
THE PEPTIDE SEQUENCE TAG APPROACH
If you have read my first blogpost “The Pre-Proteomics Era”, you may recall that the algorithms SEQUEST and PeptideSearch. These two algorithms pioneered the analysis of uninterpreted shotgun proteomics spectra in an automated manner.
SEQUEST pioneered the cross-correlation concept, subsequently adapted by most search engines developed so far.
PeptideSearch on the other hand, introduced the peptide sequence tag concept, which can be operated in error-tolerant mode for the identification of unsuspected PTMs.
The search engine GutenTag published in 2003 was the first search algorithm to implement the peptide sequence tag approach in a fully automated manner.
In the coming years, many more search engines incorporated the peptide sequence tag approach to enable the unrestricted search of unsuspected PTMs.
Roughly half of the search algorithms listed in Table 1 above, use the peptide sequence tag concept.
Unrestricted identification of PTMs is possible with Open-pFind, which is an ultrafast search engine based on the peptide sequence tag.
This search engine interfaces the peptide sequence tag method to machine learning algorithms, which boost the PSM workflow in various ways. It also makes use of the fragment ion index concept to speed up the unrestricted PTM identification task.
The power of Open-pFind is underscored by its use in the Chromosome-Centric Human Proteome Project (C-HPP) to help identify recalcitrant “missing proteins”.

Figure 3. Peptide sequence tag approach. The peptide sequence tag concept first implemented in PeptideSearch can be explained as having four steps. Step 1 - Peptide sequence tag extraction. An initial PSM identifies a tag (string of four or more sequential amino acids in the experimental MS2 spectra (labeled "tag" in cartoon). Step 2 - Mass value calculation of N- and C-terminus regions flanking the tag. Region 1 (peptide N-terminus). Region 3 (peptide C-terminus).Steps 3 - Mass-restricted PSM of the N- or C-terminus. Either N- or C-terminus is chosen for mass-restricted PSM and sequence assignment. Mass-restricted means that the fragment masses must add up to the region's (N- or C-terminus) mass value. Step 4 - Error-tolerant PSM of the remaining region (N- or C-terminus). The sequence of the unassigned region is assigned using an error-tolerant / mass-tolerant PSM approach, so that unsuspected PTMs can be identified. An explanation of the peptide sequence tag concept can be found in the publication, which described the pioneering algorithm: PeptideSearch. The citation provided in the above paragraph.
ULTRAFAST OPEN SEARCH ALGORITHMS
The open search approach refers to the use of a wide or unrestrictive peptide mass tolerance window (typically 500 ppm) for the selection of theoretical peptides in the cross-correlation PSM workflow. This enables an agnostic survey of every possible PTMs on a peptide.
The cartoon in Figure 4, provides a brief description of the open search concept.
Traditional search engines on the other hand, implement a “closed search” approach, in which a narrow mass tolerance window (5-10 ppm) is used to select theoretical peptides with a predicted mass close to the one calculated for experimental each precursor ion.
SEQUEST HT™ was the first search engine to implement the open search concept in a 2015 publication, which aimed at reducing the false negative identification rate (data loss). This study showed that a large proportion of unassigned spectra (lost data) is comprised of modified peptides.
A next generation of ultrafast open search algorithms has been pioneered by MSFragger, which like Open-pFind discussed in the previous section, is powered by a fragment ion indexing method, to speed up the PSM workflow.
MSFragger is embedded in the computational platform FragPipe, where it is interfaced to various machine learning algorithms, which boost the PSM workflow in various ways, and advanced FDR estimation strategies, like the picked FDR approach.

Figure 4. During the PSM workflow, theoretical peptides are selected if their mass matches the one calculated for the a experimental precursor ion present in the raw data file. The target-decoy competition (TDC ) in the PSM workflow determines if the best-scoring is a target or a decoy assignment. In a closed search a narrow mass tolerance window keeps the number of theoretical peptides with a mass close to the experimental one at a minimum, typically at 5-10 ppm for high data collected on high mass accuracy and resolution instruments (e.g., Orbitrap mass analyzers). This provides sensitivity and specificity to the PSM workflow. In an open search, the mass tolerance window is widened up to about 500 ppm. This enables an agnostic survey of theoretical peptides, so that unsuspected PTMs can be identified.
Chimeric MS2 spectra derive from co-eluting peptides in complex samples (e.g., plasma proteome), and are an important source of false positive and negative hits. Figure 5 below is an adaptation from Figure 2 in my previous post “PSM: a Chain of Chance".

Figure 5. This figure depicts the PSM workflow in the context of chimeric MS2. The experimental MS2 has black and red mass ions, which indicate that they derive from a different precursor ion (MS1). Theoretical and experimental MS2 are shown in mirror image with respect to each other.
The chimeric spectra problem is a recalcitrant one in shotgun proteomics, and its incidence increases proportionally to sample complexity, because it derives from co-eluting peptides.
Cutting-edge ultrafast mass spectrometers like the Astral from Thermo Fischer Scientific alleviate this interference but do not eliminate it completely.
A number of machine learning and deep learning methods have been developed, which tackle the chimeric spectra problem and others in shotgun proteomics.
Deep learning methods in shotgun proteomics will be the next topic I will tell you about.
Stay tuned!
GPR
Disclosure: At BioTech Writing and Consulting we believe in the use AI in Data Science, but do not use AI to generate text or images.
Comentários