State-space models (SSMs), such as Mamba (Gu & Dao, 2023), have been proposed as alternatives to Transformer networks in language modeling, incorporating gating, convolutions, and input-dependent token selection to mitigate the quadratic cost of multi-head attention. Although SSMs exhibit competitive performance, their in-context learning (ICL) capabilities, a remarkable emergent property of modern language models that enables task execution without parameter optimization, remain less explored compared to Transformers. In this study, we evaluate the ICL performance of SSMs, focusing on Mamba, against Transformer models across various tasks. Our results show that SSMs perform comparably to Transformers in standard regression ICL tasks, while outperforming them in tasks like sparse parity learning. However, SSMs fall short in tasks involving non-standard retrieval functionality. To address these limitations, we introduce a hybrid model, MambaFormer, that combines Mamba with attention blocks, surpassing individual models in tasks where they struggle independently. Our findings suggest that hybrid architectures offer promising avenues for enhancing ICL in language models.
more »
« less
Attention Is Not Enough
The human ability to generalize beyond interpolation, often called extrapolation or symbol-binding, is challenging to recreate with computational models. Biologically plausible models incorporating indirection mechanisms have demonstrated strong performance in this regard. Deep learning approaches such as Long Short-Term Memory (LSTM) and Transformers have shown varying degrees of success, but recent work has suggested that Transformers are capable of extrapolation as well. We evaluate the capabilities of the above approaches on a series of increasingly complex sentence-processing tasks to infer the capacity of each individual architecture to extrapolate sentential roles across novel word fillers. We confirm that the Transformer does possess superior abstraction capabilities compared to LSTM. However, what it does not possess is extrapolation capabilities, as evidenced by clear performance disparities on novel filler tasks as compared to working memory-based indirection models.
more »
« less
- Award ID(s):
- 1742518
- PAR ID:
- 10499907
- Publisher / Repository:
- eScholarship - University of California
- Date Published:
- Journal Name:
- Proceedings of the Annual Conference of the Cognitive Science Society
- ISSN:
- 1069-7977
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
For a number of years since their introduction to hydrology, recurrent neural networks like long short-term memory (LSTM) networks have proven remarkably difficult to surpass in terms of daily hydrograph metrics on community-shared benchmarks. Outside of hydrology, Transformers have now become the model of choice for sequential prediction tasks, making it a curious architecture to investigate for application to hydrology. Here, we first show that a vanilla (basic) Transformer architecture is not competitive against LSTM on the widely benchmarked CAMELS streamflow dataset, and lagged especially prominently for the high-flow metrics, perhaps due to the lack of memory mechanisms. However, a recurrence-free variant of the Transformer model can obtain mixed comparisons with LSTM, producing very slightly higher Kling-Gupta efficiency coefficients (KGE), along with other metrics. The lack of advantages for the vanilla Transformer network is linked to the nature of hydrologic processes. Additionally, similar to LSTM, the Transformer can also merge multiple meteorological forcing datasets to improve model performance. Therefore, the modified Transformer represents a rare competitive architecture to LSTM in rigorous benchmarks. Valuable lessons were learned: (1) the basic Transformer architecture is not suitable for hydrologic modeling; (2) the recurrence-free modification is beneficial so future work should continue to test such modifications; and (3) the performance of state-of-the-art models may be close to the prediction limits of the dataset. As a non-recurrent model, the Transformer may bear scale advantages for learning from bigger datasets and storing knowledge. This work lays the groundwork for future explorations into pretraining models, serving as a foundational benchmark that underscores the potential benefits in hydrology.more » « less
-
Social media cyberbullying has a detrimental effect on human life. As online social networking grows daily, the amount of hate speech also increases. Such terrible content can cause depression and actions related to suicide. This paper proposes a trustable LSTM Autoencoder Network for cyberbullying detection on social media using synthetic data. We have demonstrated a cutting-edge method to address data availability difficulties by producing machine-translated data. However, several languages such as Hindi and Bangla still lack adequate investigations due to a lack of datasets. We carried out experimental identification of aggressive comments on Hindi, Bangla, and English datasets using the proposed model and traditional models, including Long Short-Term Memory (LSTM), Bidirectional Long Short-Term Memory (BiLSTM), LSTM-Autoencoder, Word2vec, Bidirectional Encoder Representations from Transformers (BERT), and Generative Pre-trained Transformer 2 (GPT-2) models. We employed evaluation metrics such as f1-score, accuracy, precision, and recall to assess the models’ performance. Our proposed model outperformed all the models on all datasets, achieving the highest accuracy of 95%. Our model achieves state-of-the-art results among all the previous works on the dataset we used in this paper.more » « less
-
Data fusion techniques have gained special interest in remote sensing due to the available capabilities to obtain measurements from the same scene using different instruments with varied resolution domains. In particular, multispectral (MS) and hyperspectral (HS) imaging fusion is used to generate high spatial and spectral images (HSEI). Deep learning data fusion models based on Long Short Term Memory (LSTM) and Convolutional Neural Networks (CNN) have been developed to achieve such task.In this work, we present a Multi-Level Propagation Learning Network (MLPLN) based on a LSTM model but that can be trained with variable data sizes in order achieve the fusion process. Moreover, the MLPLN provides an intrinsic data augmentation feature that reduces the required number of training samples. The proposed model generates a HSEI by fusing a high-spatial resolution MS image and a low spatial resolution HS image. The performance of the model is studied and compared to existing CNN and LSTM approaches by evaluating the quality of the fused image using the structural similarity metric (SSIM). The results show that an increase in the SSIM is still obtained while reducing of the number of training samples to train the MLPLN model.more » « less
-
A DRAM-based Near-Memory Architecture for Accelerated and Energy-Efficient Execution of TransformersTransformers-based language models have achieved remarkable accuracy in various NLP tasks, employing self-attention mecha- nisms primarily based on matrix multiplication. However, their significant size leads to data movement issues, causing latency and energy efficiency challenges in conventional Von-Neumann systems. To mitigate these issues, several in-memory and near- memory architectures have been proposed. This paper introduces PACT-3D, a near-memory architecture featuring novel computing units integrated with DRAM banks. PACT-3D significantly reduces latency by 1.7× and improves energy efficiency by 18.7× compared to state-of-the-art near-memory architectures.more » « less
An official website of the United States government

