You’ve Got Mail: Automating the Pipeline Quality Control Alert System
Errors are inevitable in genome sequencing but are often not realized until suspicious inconsistencies appear post-analysis. In fact, a published study estimates that mislabelling alone affects 4% of samples comprising a given dataset. This costs time and resources; however, quality control measures in the data analysis pipeline can improve productivity by identifying issues early on. Although statistics indicative of data quality, such as the level of contamination and artifact oxidation, are currently generated in phoenix, they are not actively checked by most users. The Python package and command-line tool sendqc, which can be found in the TGen GitHub repository, collects and interprets these quality control statistics and generates warning messages that can be sent to users in an email. Sendqc is designed with extensibility in mind to cover a wide scope of metric tools including Picard, SAMtools, and the newest addition peddy, which completes a sex and ethnicity check on viable samples. These tools run in parallel with the analysis process and thus, do not increase wait time. Since there has not been much flow of data, the exact benefits of reporting quality control metrics to the user is currently unclear and can only be quantified after long term use.