skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on May 5, 2026

Title: Large Language Model can Reduce the Necessity of Using Large Data Samples for Training Models
This work introduces an novel approach to improving cybersecurity systems to focus on spam email-based cyberattacks. The proposed technique tackles the challenge of training Machine Learning (ML) models with limited data samples by leveraging Bidirectional Encoder Representations from Transformers (BERT) for contextualized embeddings. Unlike traditional embedding methods, BERT offers a nuanced representation of smaller datasets, enabling more effective ML model training. The methodology will use several pre-trained BERT models for generating contextualized embeddings using data samples, and these embeddings will be fed to various ML algorithms for effective training. This approach demonstrates that even with scarce data, BERT embeddings significantly enhance model performance compared to conventional embedding approaches like Word2Vec. The technique proves especially advantageous for insufficient instances of high-quality dataset. The result of this proposed work outperforms traditional techniques to mitigate phishing attacks with few data samples. This work provides a robust accuracy of 99.25% when we use multilingual BERT (M-BERT) to embed dataset.  more » « less
Award ID(s):
2433800 1946442
PAR ID:
10621354
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
IEEE
Date Published:
ISBN:
979-8-3315-2400-5
Page Range / eLocation ID:
988 to 991
Subject(s) / Keyword(s):
Self Attention Multi-Head Attention Transformer Embedding
Format(s):
Medium: X
Location:
Santa Clara, CA, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. Although software developers of mHealth apps are responsible for protecting patient data and adhering to strict privacy and security requirements, many of them lack awareness of HIPAA regulations and struggle to distinguish between HIPAA rules categories. Therefore, providing guidance of HIPAA rules patterns classification is essential for developing secured applications for Google Play Store. In this work, we identified the limitations of traditional Word2Vec embeddings in processing code patterns. To address this, we adopt multilingual BERT (Bidirectional Encoder Representations from Transformers) which offers contextualized embeddings to the attributes of dataset to overcome the issues. Therefore, we applied this BERT to our dataset for embedding code patterns and then uses these embedded code to various machine learning approaches. Our results demonstrate that the models significantly enhances classification performance, with Logistic Regression achieving a remarkable accuracy of 99.95%. Additionally, we obtained high accuracy from Support Vector Machine (99.79%), Random Forest (99.73%), and Naive Bayes (95.93%), outperforming existing approaches. This work underscores the effectiveness and showcases its potential for secure application development. 
    more » « less
  2. Machine learning (ML) and deep learning (DL) techniques are increasingly applied to produce efficient query optimizers, in particular in regards to big data systems. The optimization of spatial operations is even more challenging due to the inherent complexity of such kind of operations, like spatial join or range query, and the peculiarities of spatial data. Although a few ML-based spatial query optimizers have been proposed in literature, their design limits their use, since each one is tailored for a specific collection of datasets, a specific operation, or a specific hardware setting. Changes to any of these will require building and training a completely new model which entails collecting a new very large training dataset to obtain a good model. This paper proposes a different approach which exploits the use of the novel notion ofspatial embeddingto overcome these limitations. In particular, a preliminary model is defined which captures the relevant features of spatial datasets, independently from the operation to be optimized and in an unsupervised manner. This model is trained with a large amount of both synthetic and real-world data, with the aim to produce meaningful spatial embeddings. The construction of an embedding model could be intended as a preliminary step for the optimization of many different spatial operations, so the cost of its building can be compensated during the subsequent construction of specific models. Indeed, for each considered spatial operation, a specific tailored model will be trained but by using spatial embeddings as input, so a very little amount of training data points is required for them. Three peculiar operations are considered as proof of concept in this paper: range query, self-join, and binary spatial join. Finally, a comparison with an alternative technique, known as transfer learning, is provided and the advantages of the proposed technique over it are highlighted. 
    more » « less
  3. Neural networks provide new possibilities to automatically learn complex language patterns and query-document relations. Neural IR models have achieved promising results in learning query-document relevance patterns, but few explorations have been done on understanding the text content of a query or a document. This paper studies leveraging a recently-proposed contextual neural language model, BERT, to provide deeper text understanding for IR.Experimental results demonstrate that the contextual text representations from BERT are more effective than traditional word embeddings. Compared to bag-of-words retrieval models, the contextual language model can better leverage language structures, bringing large improvements on queries written in natural languages. Combining the text understanding ability with search knowledge leads to an enhanced pre-trained BERT model that can benefit related search tasks where training data are limited. 
    more » « less
  4. Computational models of verbal analogy and relational similarity judgments can employ different types of vector representations of word meanings (embeddings) generated by machine-learning algorithms. An important question is whether human-like relational processing depends on explicit representations of relations (i.e., representations separable from those of the concepts being related), or whether implicit relation representations suffice. Earlier machine-learning models produced static embeddings for individual words, identical across all contexts. However, more recent Large Language Models (LLMs), which use transformer architectures applied to much larger training corpora, are able to produce contextualized embeddings that have the potential to capture implicit knowledge of semantic relations. Here we compare multiple models based on different types of embeddings to human data concerning judgments of relational similarity and solutions of verbal analogy problems. For two datasets, a model that learns explicit representations of relations, Bayesian Analogy with Relational Transformations (BART), captured human performance more successfully than either a model using static embeddings (Word2vec) or models using contextualized embeddings created by LLMs (BERT, RoBERTa, and GPT-2). These findings support the proposal that human thinking depends on representations that separate relations from the concepts they relate. 
    more » « less
  5. Dense document embeddings are central to neural retrieval. The dominant paradigm is to train and construct embeddings by running encoders directly on individual documents. In this work, we argue that these embeddings, while effective, are implicitly out-of-context for targeted use cases of retrieval, and that a contextualized document embedding should take into account both the document and neighboring documents in context - analogous to contextualized word embeddings. We propose two complementary methods for contextualized document embeddings: first, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss; second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation. Results show that both methods achieve better performance than biencoders in several settings, with differences especially pronounced out-of-domain. We achieve state-of-the-art results on the MTEB benchmark with no hard negative mining, score distillation, dataset-specific instructions, intra-GPU example-sharing, or extremely large batch sizes. Our method can be applied to improve performance on any contrastive learning dataset and any biencoder. 
    more » « less