Large Language Model can Reduce the Necessity of Using Large Data Samples for Training Models

Rahman, Md Abdur; Francia, Guillermo; Shahriar, Hossain; Ahamed, Sheikh Iqbal

doi:10.1109/CAI64502.2025.00173

Citation Details

This content will become publicly available on May 5, 2026

Large Language Model can Reduce the Necessity of Using Large Data Samples for Training Models

This work introduces an novel approach to improving cybersecurity systems to focus on spam email-based cyberattacks. The proposed technique tackles the challenge of training Machine Learning (ML) models with limited data samples by leveraging Bidirectional Encoder Representations from Transformers (BERT) for contextualized embeddings. Unlike traditional embedding methods, BERT offers a nuanced representation of smaller datasets, enabling more effective ML model training. The methodology will use several pre-trained BERT models for generating contextualized embeddings using data samples, and these embeddings will be fed to various ML algorithms for effective training. This approach demonstrates that even with scarce data, BERT embeddings significantly enhance model performance compared to conventional embedding approaches like Word2Vec. The technique proves especially advantageous for insufficient instances of high-quality dataset. The result of this proposed work outperforms traditional techniques to mitigate phishing attacks with few data samples. This work provides a robust accuracy of 99.25% when we use multilingual BERT (M-BERT) to embed dataset. more »

Award ID(s):: 2433800 1946442

PAR ID:: 10621354

Author(s) / Creator(s):: Rahman, Md Abdur; Francia, Guillermo; Shahriar, Hossain; Ahamed, Sheikh Iqbal

Publisher / Repository:: IEEE

Date Published:: 2025-05-05

ISBN:: 979-8-3315-2400-5

Page Range / eLocation ID:: 988 to 991

Subject(s) / Keyword(s):: Self Attention Multi-Head Attention Transformer Embedding

Format(s):: Medium: X

Location:: Santa Clara, CA, USA

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on May 5, 2026
Conference Paper:
https://doi.org/10.1109/CAI64502.2025.00173

More Like this