CERENKOV: Computational Elucidation of the Regulatory Noncoding Variome

Yao, Yao; Liu, Zheng; Singh, Satpreet; Wei, Qi; Ramsey, Stephen A.

doi:10.1145/3107411.3107414

We describe a novel computational approach, CERENKOV (Computational Elucidation of the REgulatory NonKOd- ing Variome), for discriminating regulatory single nucleotide polymorphisms (rSNPs) from non-regulatory SNPs within noncoding genetic loci. CERENKOV is specifically designed for recognizing rSNPs in the context of a post-analysis of a genome-wide association study (GWAS); it includes a novel accuracy scoring metric (which we call average rank, or AV- GRANK) and a novel cross-validation strategy (locus-based sampling) that both correctly account for the “sparse positive bag” nature of the GWAS post-analysis rSNP recognition problem. We trained and validated CERENKOV using a reference set of 15,331 SNPs (the OSU17 SNP set) whose composition is based on selection criteria (linkage disequi- librium and minor allele frequency) that we designed to ensure relevance to GWAS post-analysis. CERENKOV is based on a machine-learning algorithm (gradient boosted decision trees) incorporating 246 SNP annotation features that we extracted from genomic, epigenomic, phylogenetic, and chromatin datasets. CERENKOV includes novel features based on replication timing and DNA shape. We found that tuning a classifier for AUPVR performance does not guaran- tee optimality for AVGRANK. We compared the validation performance of CERENKOV to nine other methods for rSNP recognition (including GWAVA, RSVP, DeltaSVM, DeepSEA, Eigen, and DANQ), and found that CERENKOV’s validation performance is the strongest out of all of the classifiers that we tested, by both traditional global rank-based measures (⟨AUPVR⟩ = 0.506; ⟨AUROC⟩ = 0.855) and AVGRANK (⟨AVGRANK⟩ = 3.877). The source code for CERENKOV is available on GitHub and the SNP feature data files are available for download via the CERENKOV website.

More Like this