CyBERT: Contextualized Embeddings for the Cybersecurity Domain

Ranade, Priyanka; Piplai, Aritran; Joshi, Anupam; Finin, Tim

doi:10.1109/BigData52589.2021.9671824

Citation Details

CyBERT: Contextualized Embeddings for the Cybersecurity Domain

We present CyBERT, a domain-specific Bidirectional Encoder Representations from Transformers (BERT) model, fine-tuned with a large corpus of textual cybersecurity data. State-of-the-art natural language models that can process dense, fine-grained textual threat, attack, and vulnerability information can provide numerous benefits to the cybersecurity community. The primary contribution of this paper is providing the security community with an initial fine-tuned BERT model that can perform a variety of cybersecurity-specific downstream tasks with high accuracy and efficient use of resources. We create a cybersecurity corpus from open-source unstructured and semi-unstructured Cyber Threat Intelligence (CTI) data and use it to fine-tune a base BERT model with Masked Language Modeling (MLM) to recognize specialized cybersecurity entities. We evaluate the model using various downstream tasks that can benefit modern Security Operations Centers (SOCs). The finetuned CyBERT model outperforms the base BERT model in the domain-specific MLM evaluation. We also provide use-cases of CyBERT application in cybersecurity based downstream tasks. more »

Award ID(s):: 2114892 1920079

PAR ID:: 10323463

Author(s) / Creator(s):: Ranade, Priyanka; Piplai, Aritran; Joshi, Anupam; Finin, Tim

Date Published:: 2021-12-15

Journal Name:: IEEE International Conference on Big Data

Page Range / eLocation ID:: 3334 to 3342

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/BigData52589.2021.9671824

More Like this