Applying Deep Learning to PepSeq: Library Optimization and MHC II Binding Prediction
PepSeq is a platform used for the high throughput analysis of peptide-protein binding. PepSeq has been used in a number of applications, including the analysis of Major Histocompatibility Complex (MHC) class II binding to large sets of peptides. MHC class II bind to foreign peptides and present them at the cell’s surface so cells in the immune system can detect them, and as such, understanding their binding behavior is very immunologically relevant. The large amount of data produced by PepSeq makes it an ideal candidate for machine learning applications, which typically increase in accuracy with the size of the dataset. Here we present two deep learning applications for PepSeq generated data. First, we applied deep learning to normalize the PepSeq library. The peptides used to evaluate binding are generated via in vitro transcription and translation. However, because multiple codons code for the same amino acid, many different nucleotide sequences exist that result in the same peptide. Depending on the nucleotide sequence used for a given peptide, the amount of that peptide that is produced, or its abundance, will vary. As a result, in the existing PepSeq peptide library, there is substantial fluctuation in abundance between peptides, which reduces the efficiency of the experiment. In this project, we applied deep learning to predict the abundance of a peptide given its nucleotide sequence. The relationship between the model’s predictions of previously unseen data and the actual abundance values were found to have a Pearson’s correlation coefficient of 0.35. We then wrote a script, which generates a number of random nucleotide sequences for a given peptide and then uses the model to select the nucleotide sequences with a narrow abundance range compared to the library. By using this script, scientists can automate the nucleotide sequence design, saving time and improving experimental results. We also applied deep learning to predict MHC II binding for unknown peptides from PepSeq data. Existing tools predict MHC II binding- however, their performance leaves room for improvement. Leveraging the large datasets from PepSeq in a deep learning application may yield some performance improvements. We are currently developing a binary classifier to sort peptides as being likely or unlikely to bind with MHC class II.