To trim or not to trim? Optimizing the detection of fusion transcripts
In cancer patients, transcription of hybrid RNA molecules (fusion transcripts) and production of novel proteins, which promote tumor development, occur because of structural DNA rearrangements that change gene orientations. While both whole-genome DNA and RNA sequencing detect structural events, sequencing RNA is much less expensive, so accurate fusion transcript detection is important for cancer researchers. Various open-source programs identify fusions in RNA sequencing libraries, but most have high false positive (FP) rates. Potential sources of false positives are low-quality sequencing reads and untrimmed adapters. Typical RNA libraries have insert (fragmented RNA) sizes that fall below the uniform read length, so artificial adapters are included to complete the read. These adapter sequences could cause false positives because they incorrectly map to DNA references and prompt tools to select unsupported fusions. To test possible FP sources, we ran two fusion-finding tools (TopHat-Fusion, STAR-Fusion) with two input read types (untrimmed, trimmed) and determined whether trimming RNA input lowered the FP rate of our tools, thus improving accuracy of fusion detection.
The untrimmed reads were obtained from 10 cell lines, and we expected some to contain synthetic adapter sequences and low-quality bases from sequencing technology. Trimming, or removing those adapters and bases with a separate program (Trimmomatic), consumed additional computation time (6 CPU-hours/1.8 billion paired-end reads) and resulted in 27.7-36.3% data loss. We validated results by comparing fusions identified by each tool to corresponding whole-genome sequencing, and calculated FP rates for untrimmed and trimmed inputs.
For TopHat-Fusion, FP rate was 62.0% for untrimmed reads and 61.9% for trimmed (p=0.74), while for STAR-Fusion the rates were 79.7% and 85.8% respectively (p<0.001). There was no significant difference between untrimmed versus trimmed FP rates with TopHat-Fusion, while FP rate increased significantly with STAR-Fusion. While the effect of trimming reads on FP rate varies between fusion-finding tools, taking the computational time to trim does not improve accuracy of fusion identification. Because TGen’s analysis pipeline uses TopHat-Fusion without trimming steps, our results suggest that for RNA read trimming, no protocol changes are necessary for fusion transcript detection.