A significant improvement in preprocessing efficiency in somatic variant calling
Somatic variant calling is the process of identifying variants in somatic, or non-reproductive, cells. These alterations in DNA can cause cell malfunction which in some cases leads to unregulated cell division, or cancer. Therefore, calling (identifying) these variants is crucial to truly understanding the nature of cancer on a patient-by-patient basis. We focus on a variant caller known as “LumosVar,” which suffers from much longer execution times than any other variant caller in the TGen’s main genomics pipeline (mutect, seurat, strelka). The main culprit is a preprocessing step with several inefficiencies that we believed could be sped up considerably. We present a rewrite of this step that significantly reduces the execution time, by switching from a slow perl mpileup parser to a direct C implementation using HTSlib (High-Throughput Sequencing library). While the old implementation could take upwards of eight hours to run, the new implementation completes in around 20 minutes. This speedup brings LumosVar’s total execution time much closer to that of the other three callers, making it more attractive as an addition to the pipeline. This is important because LumosVar uses data in classifying variants that the other callers don’t, allowing it to potentially avoid false positives generated by other callers.