An Efficient Approach to Identify Optical Duplicates
Correctly marking duplicates, or copies of a single DNA molecule, is imperative to maintaining statistical integrity and reducing bias in DNA sequencing analysis. In addition, it’s necessary to identify two different types of duplicates: optical/platform duplicates, which are duplicates identified based on physical distance on the sequencing flowcell, and PCR duplicates, which are duplicates created during library preparation by PCR amplification steps, as laboratory optimizations to minimize each type of duplicate requires independent procedural changes. Although tools such as Picard/GATK MarkDuplicates differentiate between optical duplicates and PCR duplicates, Picard/GATK is computationally intensive, requiring approximately 12 hours of computational time for a standard human genome. This necessitates the development of a significantly faster tool that maintains equivalent performance, while also having high ease of use. This led to the development of a tool called FindOpDups, which when added to the TGen Phoenix workflow*, is 4.5 times faster than the leading implementation, and delivers similar results.