Abstract The lncATLAS database quantifies the relative cytoplasmic versus nuclear abundance of long non-coding RNAs (lncRNAs) observed in 15 human cell lines. The literature describes several machine learning models trained and evaluated on these and similar datasets. These reports showed moderate performance, e.g. 72–74% accuracy, on test subsets of the data withheld from training. In all these reports, the datasets were filtered to include genes with extreme values while excluding genes with values in the middle range and the filters were applied prior to partitioning the data into training and testing subsets. Using several models and lncATLAS data, we show that this ‘middle exclusion’ protocol boosts performance metrics without boosting model performance on unfiltered test data. We show that various models achieve only about 60% accuracy when evaluated on unfiltered lncRNA data. We suggest that the problem of predicting lncRNA subcellular localization from nucleotide sequences is more challenging than currently perceived. We provide a basic model and evaluation procedure as a benchmark for future studies of this problem. 
                        more » 
                        « less   
                    This content will become publicly available on August 1, 2026
                            
                            LncRNA Subcellular Localization Across Diverse Cell Lines: An Exploration Using Deep Learning with Inexact q-mers
                        
                    
    
            Background: Long non-coding Ribonucleic Acids (lncRNAs) can be localized to different cellular compartments, such as the nuclear and the cytoplasmic regions. Their biological functions are influenced by the region of the cell where they are located. Compared to the vast number of lncRNAs, only a relatively small proportion have annotations regarding their subcellular localization. It would be helpful if those few annotated lncRNAs could be leveraged to develop predictive models for localization of other lncRNAs. Methods: Conventional computational methods use q-mer profiles from lncRNA sequences and train machine learning models such as support vector machines and logistic regression with the profiles. These methods focus on the exact q-mer. Given possible sequence mutations and other uncertainties in genomic sequences and their role in biological function, a consideration of these variabilities might improve our ability to model lncRNAs and their localization. Thus, we build on inexact q-mers and use machine learning/deep learning techniques to study three specific problems in lncRNA subcellular localization, namely, prediction of lncRNA localization using inexact q-mers, the issue of whether lncRNA localization is cell-type-specific, and the notion of switching (lncRNA) genes. Results: We performed our analysis using data on lncRNA localization across 15 cell lines. Our results showed that using inexact q-mers (with q = 6) can improve the lncRNA localization prediction performance compared to using exact q-mers. Further, we showed that lncRNA localization, in general, is not cell-line-specific. We also identified a category of LncRNAs which switch cellular compartments between different cell lines (we call them switching lncRNAs). These switching lncRNAs complicate the problem of predicting lncRNA localization using machine learning models, showing that lncRNA localization is still a major challenge. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10634928
- Publisher / Repository:
- MDPI
- Date Published:
- Journal Name:
- Non-Coding RNA
- Volume:
- 11
- Issue:
- 4
- ISSN:
- 2311-553X
- Page Range / eLocation ID:
- 49
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            SEquence Evaluation throughk-mer Representation (SEEKR) is a method of sequence comparison that uses sequence substrings calledk-mers to quantify the nonlinear similarity between nucleic acid species. We describe the development of new functions within SEEKR that enable end-users to estimateP-values that ascribe statistical significance to SEEKR-derived similarities, as well as visualize different aspects ofk-mer similarity. We apply the new functions to identify chromatin-enriched lncRNAs that containXIST-like sequence features, and we demonstrate the utility of applying SEEKR on lncRNA fragments to identify potential RNA-protein interaction domains. We also highlight ways in which SEEKR can be applied to augment studies of lncRNA conservation, and we outline the best practice of visualizing RNA-seq read density to evaluate support for lncRNA annotations before their in-depth study in cell types of interest.more » « less
- 
            Long noncoding RNA (lncRNA) plays key roles in tumorigenesis. Misexpression of lncRNA can lead to changes in expression profiles of various target genes, which are involved in cancer initiation and progression. So, identifying key lncRNAs for a cancer would help develop the cancer therapy. Usually, to identify key lncRNAs for a cancer, expression profiles of lncRNAs for normal and cancer samples are required. But, this kind of data are not available for all cancers. In the present study, a computational framework is developed to identify cancer specific key lncRNAs using the lncRNA expression of cancer patients only. The framework consists of two state-of-the-art feature selection techniques - Recursive Feature Elimination (RFE) and Least Absolute Shrinkage and Selection Operator (LASSO); and five machine learning models - Naive Bayes, K-Nearest Neighbor, Random Forest, Support Vector Machine, and Deep Neural Network. For experiment, expression values of lncRNAs for 8 cancers - BLCA, CESC, COAD, HNSC, KIRP, LGG, LIHC, and LUAD - from TCGA are used. The combined dataset consists of 3,656 patients with expression values of 12,309 lncRNAs. Important features or key lncRNAs are identified by using feature selection algorithms RFE and LASSO. Capability of these key lncRNAs in classifying 8 different cancers is checked by the performance of five classification models. This study identified 37 key lncRNAs that can classify 8 different cancer types with an accuracy ranging from 94% to 97%. Finally, survival analysis supports that the discovered key lncRNAs are capable of differentiating between high-risk and low-risk patients.more » « less
- 
            null (Ed.)Abstract Mammalian cells process information through coordinated spatiotemporal regulation of proteins. Engineering cellular networks thus relies on efficient tools for regulating protein levels in specific subcellular compartments. To address the need to manipulate the extent and dynamics of protein localization, we developed a platform technology for the target-specific control of protein destination. This platform is based on bifunctional molecules comprising a target-specific nanobody and universal sequences determining target subcellular localization or degradation rate. We demonstrate that nanobody-mediated localization depends on the expression level of the target and the nanobody, and the extent of target subcellular localization can be regulated by combining multiple target-specific nanobodies with distinct localization or degradation sequences. We also show that this platform for nanobody-mediated target localization and degradation can be regulated transcriptionally and integrated within orthogonal genetic circuits to achieve the desired temporal control over spatial regulation of target proteins. The platform reported in this study provides an innovative tool to control protein subcellular localization, which will be useful to investigate protein function and regulate large synthetic gene circuits.more » « less
- 
            null (Ed.)Cell cycle is a cellular process that is subject to stringent control. In contrast to the wealth of knowledge of proteins controlling the cell cycle, very little is known about the molecular role of lncRNAs (long noncoding RNAs) in cell-cycle progression. By performing genome-wide transcriptome analyses in cell-cycle-synchronized cells, we observed cell-cycle phase-specific induction of >2000 lncRNAs. Further, we demonstrate that an S-phase-upregulated lncRNA, SUNO1 , facilitates cell-cycle progression by promoting YAP1-mediated gene expression. SUNO1 facilitates the cell-cycle-specific transcription of WTIP , a positive regulator of YAP1, by promoting the co-activator, DDX5-mediated stabilization of RNA polymerase II on chromatin. Finally, elevated SUNO1 levels are associated with poor cancer prognosis and tumorigenicity, implying its pro-survival role. Thus, we demonstrate the role of a S-phase up-regulated lncRNA in cell-cycle progression via modulating the expression of genes controlling cell proliferation.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
