Computational approach to population specific structural variation and associated disease biology
The current human reference genome, derived mostly from a single male donor, comprises of 57% European, 37% African and 6% East Asian ancestry. Consequently, the current human reference genome is not representative of global genetic diversity and will likely impede the efforts of bringing precision medicine to all populations. In this study, we focused on identifying genetic variations, missing from the current reference genome, that may help us learn more about diseases which have higher incidence rates in underrepresented populations (e.g., renal cell carcinoma in Native American communities). In a previous study conducted by Almarri et al., 911 samples from the Human Genome Diversity Panel representing 54 populations were studied. One of these populations was the Pima Indian community from Northwestern Mexico, which was the focus of our work. Variant Call Files (VCFs) for four types of structural variants—deletions, insertions, inversions, and duplications—were studied to find variants that are either enriched or private in the Pima Indian Community. Initial analyses focused on deletions and utilized PLINK2 to calculate the variant allele frequency for each of the 68,098 deletion variants. The data obtained was filtered computationally to find private variants that have a frequency of more than 0.1 for Pima Indians and 0 for the other populations. Similarly, enriched variants were identified by filtering for variants with a frequency of more than 0.1 for Pima Indians; and a ratio of Pima allele frequency to the maximum allele frequency, for the remaining 53 populations, greater than 3. The filtered variants were mapped to genes using Ensemble’s BioMart, and the protein coding genes identified were investigated for their disease biology using GeneCards and NCBI. 36 private and 21 enriched variants were identified from these analyses. The private and enriched variants mapped to 21 protein coding genes, in addition to lncRNA and processed pseudogenes. While the protein coding genes were associated with a vast spectrum of conditions, our results showed TSPAN8 and SLC30A8 to contain population specific variation, suggesting a potential role in diabetes and or cancer. Our results confirm the presence of clinically useful population-specific variation warranting further study, including population-specific sequencing.