This work introduces an novel approach to improving cybersecurity systems to focus on spam email-based cyberattacks. The proposed technique tackles the challenge of training Machine Learning (ML) models with limited data samples by leveraging Bidirectional Encoder Representations from Transformers (BERT) for contextualized embeddings. Unlike traditional embedding methods, BERT offers a nuanced representation of smaller datasets, enabling more effective ML model training. The methodology will use several pre-trained BERT models for generating contextualized embeddings using data samples, and these embeddings will be fed to various ML algorithms for effective training. This approach demonstrates that even with scarce data, BERT embeddings significantly enhance model performance compared to conventional embedding approaches like Word2Vec. The technique proves especially advantageous for insufficient instances of high-quality dataset. The result of this proposed work outperforms traditional techniques to mitigate phishing attacks with few data samples. This work provides a robust accuracy of 99.25% when we use multilingual BERT (M-BERT) to embed dataset.
more »
« less
This content will become publicly available on July 8, 2026
Embedding with Large Language Models for Classification of HIPAA Safeguard Compliance Rules
Although software developers of mHealth apps are responsible for protecting patient data and adhering to strict privacy and security requirements, many of them lack awareness of HIPAA regulations and struggle to distinguish between HIPAA rules categories. Therefore, providing guidance of HIPAA rules patterns classification is essential for developing secured applications for Google Play Store. In this work, we identified the limitations of traditional Word2Vec embeddings in processing code patterns. To address this, we adopt multilingual BERT (Bidirectional Encoder Representations from Transformers) which offers contextualized embeddings to the attributes of dataset to overcome the issues. Therefore, we applied this BERT to our dataset for embedding code patterns and then uses these embedded code to various machine learning approaches. Our results demonstrate that the models significantly enhances classification performance, with Logistic Regression achieving a remarkable accuracy of 99.95%. Additionally, we obtained high accuracy from Support Vector Machine (99.79%), Random Forest (99.73%), and Naive Bayes (95.93%), outperforming existing approaches. This work underscores the effectiveness and showcases its potential for secure application development.
more »
« less
- PAR ID:
- 10621467
- Publisher / Repository:
- IEEE
- Date Published:
- Page Range / eLocation ID:
- 1040-1046
- Subject(s) / Keyword(s):
- HIPAA Compliance HIPAA Technical Safeguard Rules Large Language Models BERT for code processing
- Format(s):
- Medium: X
- Location:
- Toronto, Canada
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We introduce randomized algorithms to Clifford's Geometric Algebra, generalizing randomized linear algebra to hypercomplex vector spaces. This novel approach has many implications in machine learning, including training neural networks to global optimality via convex optimization. Additionally, we consider fine-tuning large language model (LLM) embeddings as a key application area, exploring the intersection of geometric algebra and modern AI techniques. In particular, we conduct a comparative analysis of the robustness of transfer learning via embeddings, such as OpenAI GPT models and BERT, using traditional methods versus our novel approach based on convex optimization. We test our convex optimization transfer learning method across a variety of case studies, employing different embeddings (GPT-4 and BERT embeddings) and different text classification datasets (IMDb, Amazon Polarity Dataset, and GLUE) with a range of hyperparameter settings. Our results demonstrate that convex optimization and geometric algebra not only enhances the performance of LLMs but also offers a more stable and reliable method of transfer learning via embeddings.more » « less
-
Knowledge Graph embeddings model semantic and struc- tural knowledge of entities in the context of the Knowledge Graph. A nascent research direction has been to study the utilization of such graph embeddings for the IR-centric task of entity ranking. In this work, we replicate the GEEER study of Gerritse et al. [9] which demonstrated improvements of Wiki2Vec embeddings on entity ranking tasks on the DBpediaV2 dataset. We further extend the study by exploring additional state-of-the-art entity embeddings ERNIE [27] and E-BERT [19], and by including another test collection, TREC CAR, with queries not about person, location, and organization entities. We confirm the finding that entity embeddings are beneficial for the entity ranking task. Interestingly, we find that Wiki2Vec is competitive with ERNIE and E-BERT. Our code and data to aid reproducibility and further research is available at https://github.com/poojahoza/E3R-Replicabilitymore » « less
-
We propose the Adversarial DEep Learning Transpiler (ADELT), a novel approach to source-to-source transpilation between deep learning frameworks. ADELT uniquely decouples code skeleton transpilation and API keyword mapping. For code skeleton transpilation, it uses few-shot prompting on large language models (LLMs), while for API keyword mapping, it uses contextual embeddings from a code-specific BERT. These embeddings are trained in a domain-adversarial setup to generate a keyword translation dictionary. ADELT is trained on an unlabeled web-crawled deep learning corpus, without relying on any hand-crafted rules or parallel data. It outperforms state-of-the-art transpilers, improving pass@1 rate by 16.2 pts and 15.0 pts for PyTorch-Keras and PyTorch-MXNet transpilation pairs respectively. We provide open access to our code at https://github.com/gonglinyuan/adeltmore » « less
-
Embeddings for instructions have been shown to be essential for software reverse engineering and automated program analysis. However, due to the complexity of dependencies and inherent variability of instructions, instruction embeddings using models that are successful for natural language processing may not be effective. In this paper, we perform geometric analysis of instruction embeddings at the token level and instruction family level, showing much greater variability and leading to degraded performance on intrinsic analyses. Then we propose to use metric learning to improve the relationships among instructions using triplet loss. Our results on a large dataset of instruction groups shows significant improvements. We also provide a theoretical analysis of the instruction embeddings by looking at the BERT components and characteristics of inner-product matrices for attention in the transformer blocks. The code will be available publicly after the paper is accepted for publication.more » « less
An official website of the United States government
