A deep learning approach to protein profiling from mass spectral data
Right now, mass spectrometry is the industry standard in protein profiling. Mass spectra are analyzed in order to determine the constituent peptides in samples. However, during the processing of samples through the mass spectrometer, 80% of raw data is filtered out due to mismatches (e.g. mutations, chemical variations) in the proteomics dataset. Also, the analysis of spectra is performed on predetermined feature extractions.
Deep learning conducts automatic feature extraction on the spectra. In order to implement deep learning, the spectrum sample was discretized to a one-dimensional vector with 2048 buckets that represented peak data. Each input was associated with a labeled output of amino acid data. This data was fed into a convolutional neural network (CNN)- a feed-forward neural network that automatically filters spatial features in images or vectors. The CNN was built using the Keras deep learning library with GPU acceleration.
For preliminary validation, a basic CNN was created to identify spectra with sequences that ended with either the R or K amino acids. Once 97% accuracy was achieved, the input data was further discretized and more samples were added. Two separate methods were implemented to create an Amino Acid Classifier. The Individual model method created a two-class classifier for each amino acid with a binary output (presence/non-presence). Then the combined model method incorporated one 220-class amino acid classifier with a categorical output (outputs a specific class). The models reached ~97% and ~93% accuracy, respectively.
Additionally, length, diversity, and frequency models were created (~92% each). The peptides were then further preprocessed into subsequences to be trained on (~96%). Subsequences were determined by the amino acid’s charge, water affinity, and chemical makeup. All of these models can potentially be integrated to determine the complete peptide sequence from a spectrum, thereby improving the yield of identifiable protein sequences from mass spectrometry analysis.