Quality control in discovery proteomics
Discovery liquid chromatography mass-spectrometry (LC-MS) is a largely automated process relying on continuous data acquisition of patient samples prepared using proteomics approaches. Current instrumentation regularly produces up to 24GB of data per day, including raw spectral information, prior to data analysis and knowledge mining. Proteomics workflows chain complex processes from biospecimen collection to sample handling and preparation, data acquisition and bioinformatics analysis. Each task involves inherent and operator variability (e.g. preanalytical variables, instrument drift, pipetting error…). Step-wise quantitation of errors and their aggregation into global error rates provide an estimate of variability and thresholds essential to differential analysis. When implemented in a comprehensive quality control (QC) framework, such estimations provide means to monitor process-specific discrepancies and offer decisional support to guide resolution (e.g. data normalization, re-runs…). While several QC strategies have been devised, until now no ad-hoc methods to compute and track quality metrics in biomarker discovery sequences have been proposed.
The focus of this project was to implement a robust approach capable of qualifying discovery protein/ peptide matches issued from sequential LC-MS runs and provide on-the-fly estimates of runtime quality during data acquisition. We tiered high, medium, and low abundance proteins in a high-quality dataset of thirty-one technical re-injections of E. coli, a common QC standard used to bracket discovery runs. Abundance was determined by overall frequency of occurrence and validated against peptides frequencies. Coefficient of variations (%CV) for each peptide or protein spectral matches were then computed. From low to high abundance, average %CV ranged from 11% to 34%. Based on our tiered panel of quality control markers, we correlated (Pearson’s correlation coefficient) bracketed QC runs from retrospective study datasets run across multiple consecutive weeks. Each of these had previously passed gold-standard quality control. Using our method, medium and low-abundance markers highlighted minute changes in data quality early on, while more drastic changes were observed in our high-abundance marker panel in later quality control brackets. These changes were linked to a progressive drop in instrument sensitivity which had not been picked up by common QC approaches. We implemented our method as a collection of R scripts which will be integrated into a quality control framework currently under development at the Center for Proteomics to provide early assessment of overall proteomics data quality.