Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Graph-based genome representations have proven to be a powerful tool in genomic analysis due to their ability to encode variations found in multiple haplotypes and capture population genetic diversity. Such graphs also unavoidably contain paths which switch between haplotypes (i.e., recombinant paths) and thus do not fully match any of the constituent haplotypes. The number of such recombinant paths increases combinatorially with path length and cause inefficiencies and false positives when mapping reads. In this paper, we study the problem of finding reduced haplotype-aware genome graphs that incorporate only a selected subset of variants, yet contain paths corresponding to all α-long substrings of the input haplotypes (i.e., non-recombinant paths) with at most δ mismatches. Solving this problem optimally, i.e., minimizing the number of variants selected, is previously known to be NP-hard. Here, we first establish several inapproximability results regarding finding haplotype-aware reduced variation graphs of optimal size. We then present an integer linear programming (ILP) formulation for solving the problem, and experimentally demonstrate this is a computationally feasible approach for real-world problems and provides far superior reduction compared to prior approaches.more » « less
-
Abstract The coronavirus pandemic has already caused plenty of severe problems for humanity and the economy. The exact impact of the COVID-19 pandemic is still unknown, and economists and financial advisers are exploring all possible scenarios to mitigate the risks arising from the pandemic. An intriguing question is whether this pandemic and its impacts are similar, and to what extent, to any other catastrophic events that occurred in the past, such as the 2009 Great Recession. This paper intends to address this problem by analyzing official public announcements and statements issued by federal authorities such as the Federal Reserve. More specifically, we measure similarities of consecutive statements issued by the Federal Reserve during the 2009 Great Recession and the COVID-19 pandemic using natural language processing techniques. Furthermore, we explore the usage of document embedding representations of the statements in a more complex task: clustering. Our analysis shows that, using an advanced NLP technique in document embedding such as Doc2Vec, we can detect a difference of 10.8% in similarities of Federal Open Market Committee (FOMC) statements issued during the Great Recession (2007–2009) and the COVID-19 pandemic. Finally, the results of our clustering exercise show that the document embeddings representations of the statements are suitable for more complex tasks, which provides a basis for future applications of state-of-the-art natural language processing techniques using the FOMC post-meeting statements as the dataset.
-
Abstract Motivation Variation graph representations are projected to either replace or supplement conventional single genome references due to their ability to capture population genetic diversity and reduce reference bias. Vast catalogues of genetic variants for many species now exist, and it is natural to ask which among these are crucial to circumvent reference bias during read mapping. Results In this work, we propose a novel mathematical framework for variant selection, by casting it in terms of minimizing variation graph size subject to preserving paths of length α with at most δ differences. This framework leads to a rich set of problems based on the types of variants [e.g. single nucleotide polymorphisms (SNPs), indels or structural variants (SVs)], and whether the goal is to minimize the number of positions at which variants are listed or to minimize the total number of variants listed. We classify the computational complexity of these problems and provide efficient algorithms along with their software implementation when feasible. We empirically evaluate the magnitude of graph reduction achieved in human chromosome variation graphs using multiple α and δ parameter values corresponding to short and long-read resequencing characteristics. When our algorithm is run with parameter settings amenable to long-read mapping (α = 10 kbp, δ = 1000), 99.99% SNPs and 73% SVs can be safely excluded from human chromosome 1 variation graph. The graph size reduction can benefit downstream pan-genome analysis. Availability and implementation https://github.com/AT-CG/VF. Supplementary information Supplementary data are available at Bioinformatics online.more » « less
-
Machine and deep learning-based algorithms are the emerging approaches in addressing prediction problems in time series. These techniques have been shown to produce more accurate results than conventional regression-based modeling. It has been reported that artificial Recurrent Neural Networks (RNN) with memory, such as Long Short-Term Memory (LSTM), are superior compared to Autoregressive Integrated Moving Average (ARIMA) with a large margin. The LSTM-based models incorporate additional “gates” for the purpose of memorizing longer sequences of input data. The major question is that whether the gates incorporated in the LSTM architecture already offers a good prediction and whether additional training of data would be necessary to further improve the prediction. Bidirectional LSTMs (BiLSTMs) enable additional training by traversing the input data twice (i.e., 1) left-to-right, and 2) right-to-left). The research question of interest is then whether BiLSTM, with additional training capability, outperforms regular unidirectional LSTM. This paper reports a behavioral analysis and comparison of BiLSTM and LSTM models. The objective is to explore to what extend additional layers of training of data would be beneficial to tune the involved parameters. The results show that additional training of data and thus BiLSTM-based modeling offers better predictions than regular LSTM-based models. More specifically, it was observed that BiLSTM models provide better predictions compared to ARIMA and LSTM models. It was also observed that BiLSTM models reach the equilibrium much slower than LSTM-based models.more » « less
-
Forecasting time series data is an important subject in economics, business, and finance. Traditionally, there are several techniques to effectively forecast the next lag of time series data such as univariate Autoregressive (AR), univariate Moving Average (MA), Simple Exponential Smoothing (SES), and more notably Autoregressive Integrated Moving Average (ARIMA) with its many variations. In particular, ARIMA model has demonstrated its outperformance in precision and accuracy of predicting the next lags of time series. With the recent advancement in computational power of computers and more importantly development of more advanced machine learning algorithms and approaches such as deep learning, new algorithms are developed to analyze and forecast time series data. The research question investigated in this article is that whether and how the newly developed deep learning-based algorithms for forecasting time series data, such as “Long Short-Term Memory (LSTM)”, are superior to the traditional algorithms. The empirical studies conducted and reported in this article show that deep learning-based algorithms such as LSTM outperform traditional-based algorithms such as ARIMA model. More specifically, the average reduction in error rates obtained by LSTM was between 84 - 87 percent when compared to ARIMA indicating the superiority of LSTM to ARIMA. Furthermore, it was noticed that the number of training times, known as “epoch” in deep learning, had no effect on the performance of the trained forecast model and it exhibited a truly random behavior.more » « less