stitchConCorD: A Bioinformatics Tool for Validating Identity of Low-Coverage Sequencing Data
Pairwise relatedness metrics typically rely on correlating genotype calls and are not applicable to data where genotyping is unreliable. Low-coverage (0.1x to 1.0x) sequencing data can be used for screening techniques, but it can be difficult to validate the identity of such samples using tools that require high-quality genotype data. We developed stitchConCorD: a bioinformatics tool to measure sample identity when given known-genotypes and low-coverage sequencing data. Furthermore, stitchConCorD can identify if there is contamination in the sequencing data produced for the research study. Our tool selects reliable homozygous alternate positions in the known genotypes to analyze in the low-coverage read data, and then generates a ratio that reflects the concordance between the data sets. We tested stitchConCorD on 64 comparisons of whole exome and cell-free DNA samples that acted as the high-coverage data and low-coverage data respectively. This tool successfully identified every sample mismatch in our test data set. We observed a bimodal distribution in the resulting scores for the mismatched data. This finding was attributed to the ethnicities of the patient, with samples from different ethnicities having a lower concordance ratio than those with the same ethnicity. stitchConCorD acts as a failsafe to prevent misidentification of patient data, which increases the reliability of the studies it is applied to.