Evaluation of machine learning models that predict lncRNA subcellular localization

Miller, Jason R. (ORCID:0000000269122925); Yi, Weijun; Adjeroh, Donald A.

doi:10.1093/nargab/lqae125

Citation Details

Evaluation of machine learning models that predict lncRNA subcellular localization

Abstract The lncATLAS database quantifies the relative cytoplasmic versus nuclear abundance of long non-coding RNAs (lncRNAs) observed in 15 human cell lines. The literature describes several machine learning models trained and evaluated on these and similar datasets. These reports showed moderate performance, e.g. 72–74% accuracy, on test subsets of the data withheld from training. In all these reports, the datasets were filtered to include genes with extreme values while excluding genes with values in the middle range and the filters were applied prior to partitioning the data into training and testing subsets. Using several models and lncATLAS data, we show that this ‘middle exclusion’ protocol boosts performance metrics without boosting model performance on unfiltered test data. We show that various models achieve only about 60% accuracy when evaluated on unfiltered lncRNA data. We suggest that the problem of predicting lncRNA subcellular localization from nucleotide sequences is more challenging than currently perceived. We provide a basic model and evaluation procedure as a benchmark for future studies of this problem. more »

Award ID(s):: 1747788 1920920 2125872

PAR ID:: 10542691

Author(s) / Creator(s):: Miller, Jason R.; Yi, Weijun; Adjeroh, Donald A.

Publisher / Repository:: Oxford University Press

Date Published:: 2024-09-18

Journal Name:: NAR Genomics and Bioinformatics

Volume:: 6

Issue:: 3

ISSN:: 2631-9268

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Journal Article:
https://doi.org/10.1093/nargab/lqae125

More Like this