Detection of contaminants and automation of classification in multiple myeloma tumor samples
Multiple Myeloma (MM) is a cancer of plasma cells, a type of white blood cell responsible for producing a single unique immunoglobulin (IG). Myeloma involves clonal expansion of tumor cells and thus, a pure myeloma sample should theoretically produce only one IG RNA sequence, with additional IG RNA indicating decreased sample purity. Furthermore, expression of genes which are not expressed in plasma cells can also serve as an indicator of non-B cell lineage contamination. Using these principles, we developed the multiple myeloma purity checker, a tool to estimate the purity of samples from the Multiple Myeloma Research Foundation CoMMpass study. The purity checker is a Python-based software that takes in an alignment file containing RNA sequencing reads from a myeloma sample and uses the Subread tool featureCounts to obtain RNA read counts for IG genes and known contaminants. The program processes the output, generates a set of graphs to allow easy visualization of the results, calculates the percentage of non-B lineage contaminants, and returns a tab-delimited output file which includes an assessment of sample clonality. Of 986 samples tested, the program was able to classify 41.7%, requiring manual review of the remaining samples. Of the samples that were classified, the program was very accurate, correctly classifying over 99%. Of the classified samples, 99.2% were non-polyclonal. The mean value of non-B contamination for the cohort was 1.68%, with a standard deviation of 1.39%, indicating that 95% of the samples had a contamination level below 4.46%. Among samples known to be monoclonal, 99.7% had contamination less than 2.13%. Although this tool allowed us to confirm the overall purity of samples in the CoMMpass data set, in the future we would improve the tool to allow a greater percentage of samples to be classified, eliminating the need for manual review. The primary challenge will be to accomplish this without sacrificing the current accuracy achieved through the stringent requirements of the classifier.