Since the introduction of the original BERT (i.e., BASE BERT), researchers have developed various customized BERT models with
improved performance for specific domains and tasks by exploiting
the benefits of transfer learning. Due to the nature of mathematical texts, which often use domain specific vocabulary along with
equations and math symbols, we posit that the development of
a new BERT model for mathematics would be useful for many
mathematical downstream tasks. In this resource paper, we introduce our multi-institutional effort (i.e., two learning platforms and
three academic institutions in the US) toward this need: MathBERT, a model created by pre-training the BASE BERT model
on a large mathematical corpus ranging from pre-kindergarten
(pre-k), to high-school, to college graduate level mathematical content. In addition, we select three general NLP tasks that are often
used in mathematics education: prediction of knowledge component, auto-grading open-ended Q&A, and knowledge tracing, to
demonstrate the superiority of MathBERT over BASE BERT. Our
experiments show that MathBERT outperforms prior best methods
by 1.2-22% and BASE BERT by 2-8% on these tasks. In addition,
we build a mathematics specific vocabulary ‘mathVocab’ to train
with MathBERT. We discover that MathBERT pre-trained with
‘mathVocab’ outperforms MathBERT trained with the BASE BERT
vocabulary (i.e., ‘origVocab’). MathBERT is currently being adopted
at the participated leaning platforms: Stride, Inc, a commercial educational resource provider, and ASSISTments.org, a free online
educational platform. We release MathBERT for public usage at:
https://github.com/tbs17/MathBERT.
more »
« less
CyBERT: Contextualized Embeddings for the Cybersecurity Domain
We present CyBERT, a domain-specific Bidirectional Encoder Representations from Transformers (BERT) model, fine-tuned with a large corpus of textual cybersecurity data. State-of-the-art natural language models that can process dense, fine-grained textual threat, attack, and vulnerability information can provide numerous benefits to the cybersecurity community. The primary contribution of this paper is providing the security community with an initial fine-tuned BERT model that can perform a variety of cybersecurity-specific downstream tasks with high accuracy and efficient use of resources. We create a cybersecurity corpus from open-source unstructured and semi-unstructured Cyber Threat Intelligence (CTI) data and use it to fine-tune a base BERT model with Masked Language Modeling (MLM) to recognize specialized cybersecurity entities. We evaluate the model using various downstream tasks that can benefit modern Security Operations Centers (SOCs). The finetuned CyBERT model outperforms the base BERT model in the domain-specific MLM evaluation. We also provide use-cases of CyBERT application in cybersecurity based downstream tasks.
more »
« less
- PAR ID:
- 10323463
- Date Published:
- Journal Name:
- IEEE International Conference on Big Data
- Page Range / eLocation ID:
- 3334 to 3342
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Language model (LM) pretraining can learn various knowledge from text corpora, helping downstream tasks. However, existing methods such as BERT model a single document, and do not capture dependencies or knowledge that span across documents. In this work, we propose LinkBERT, an LM pretraining method that leverages links between documents, e.g., hyperlinks. Given a text corpus, we view it as a graph of documents and create LM inputs by placing linked documents in the same context. We then pretrain the LM with two joint self-supervised objectives: masked language modeling and our new proposal, document relation prediction. We show that LinkBERT outperforms BERT on various downstream tasks across two domains: the general domain (pretrained on Wikipedia with hyperlinks) and biomedical domain (pretrained on PubMed with citation links). LinkBERT is especially effective for multi-hop reasoning and few-shot QA (+5% absolute improvement on HotpotQA and TriviaQA), and our biomedical LinkBERT sets new states of the art on various BioNLP tasks (+7% on BioASQ and USMLE).more » « less
-
This paper focuses on learning domain oriented language models driven by end tasks, which aims to combine the worlds of both general-purpose language models (such as ELMo and BERT) and domain-specific language understanding. We propose DomBERT, an extension of BERT to learn from both in-domain corpus and relevant domain corpora. This helps in learning domain language models with low-resources. Experiments are conducted on an assortment of tasks in aspect based sentiment analysis (ABSA), demonstrating promising results.more » « less
-
This article introduces ConfliBERT-Spanish, a pre-trained language model specialized in political conflict and violence for text written in the Spanish language. Our methodology relies on a large corpus specialized in politics and violence to extend the capacity of pre-trained models capable of processing text in Spanish. We assess the performance of ConfliBERT-Spanish in comparison to Multilingual BERT and BETO baselines for binary classification, multi-label classification, and named entity recognition. Results show that ConfliBERT-Spanish consistently outperforms baseline models across all tasks. These results show that our domain-specific language-specific cyberinfrastructure can greatly enhance the performance of NLP models for Latin American conflict analysis. This methodological advancement opens vast opportunities to help researchers and practitioners in the security sector to effectively analyze large amounts of information with high degrees of accuracy, thus better equipping them to meet the dynamic and complex security challenges affecting the region.more » « less
-
Cyber Threat Intelligence (CTI) is information describing threat vectors, vulnerabilities, and attacks and is often used as training data for AI-based cyber defense systems such as Cybersecurity Knowledge Graphs (CKG). There is a strong need to develop community-accessible datasets to train existing AI-based cybersecurity pipelines to efficiently and accurately extract meaningful insights from CTI. We have created an initial unstructured CTI corpus from a variety of open sources that we are using to train and test cybersecurity entity models using the spaCy framework and exploring self-learning methods to automatically recognize cybersecurity entities. We also describe methods to apply cybersecurity domain entity linking with existing world knowledge from Wikidata. Our future work will survey and test spaCy NLP tools, and create methods for continuous integration of new information extracted from text.more » « less