skip to main content

This content will become publicly available on March 27, 2024

Title: AISecKG: Knowledge Graph Dataset for Cybersecurity Education
Cybersecurity education is exceptionally challenging as it involves learning the complex attacks; tools and developing critical problem-solving skills to defend the systems. For a student or novice researcher in the cybersecurity domain, there is a need to design an adaptive learning strategy that can break complex tasks and concepts into simple representations. An AI-enabled automated cybersecurity education system can improve cognitive engagement and active learning. Knowledge graphs (KG) provide a visual representation in a graph that can reason and interpret from the underlying data, making them suitable for use in education and interactive learning. However, there are no publicly available datasets for the cybersecurity education domain to build such systems. The data is present as unstructured educational course material, Wiki pages, capture the flag (CTF) writeups, etc. Creating knowledge graphs from unstructured text is challenging without an ontology or annotated dataset. However, data annotation for cybersecurity needs domain experts. To address these gaps, we made three contributions in this paper. First, we propose an ontology for the cybersecurity education domain for students and novice learners. Second, we develop AISecKG, a triple dataset with cybersecurity-related entities and relations as defined by the ontology. This dataset can be used to construct knowledge graphs more » to teach cybersecurity and promote cognitive learning. It can also be used to build downstream applications like recommendation systems or self-learning question-answering systems for students. The dataset would also help identify malicious named entities and their probable impact. Third, using this dataset, we show a downstream application to extract custom-named entities from texts and educational material on cybersecurity. « less
Agrawal, Garima
Award ID(s):
Publication Date:
Journal Name:
AAAI-MAKE 2023: Challenges Requiring the Combination of Machine Learning 2023
Sponsoring Org:
National Science Foundation
More Like this
  1. Knowledge graphs gained popularity in recent years and have been useful for concept visualization and contextual information retrieval in various applications. However, constructing a knowledge graph by scraping long and complex unstructured texts for a new domain in the absence of a well-defined ontology or an existing labeled entity-relation dataset is difficult. Domains such as cybersecurity education can harness knowledge graphs to create a student-focused interactive and learning environment to teach cybersecurity. Learning cybersecurity involves gaining the knowledge of different attack and defense techniques, system setup and solving multi-facet complex real-world challenges that demand adaptive learning strategies and cognitive engagement. However, there are no standard datasets for the cybersecurity education domain. In this research work, we present a bottom-up approach to curate entity-relation pairs and construct knowledge graphs and question-answering models for cybersecurity education. To evaluate the impact of our new learning paradigm, we conducted surveys and interviews with students after each project to find the usefulness of bot and the knowledge graphs. Our results show that students found these tools informative for learning the core concepts and they used knowledge graphs as a visual reference to cross check the progress that helped them complete the project tasks.
  2. The number of published manufacturing science digital articles available from scientifc journals and the broader web have exponentially increased every year since the 1990s. To assimilate all of this knowledge by a novice engineer or an experienced researcher, requires signifcant synthesis of the existing knowledge space contained within published material, to fnd answers to basic and complex queries. Algorithmic approaches through machine learning and specifcally Natural Language Processing (NLP) on a domain specifc area such as manufacturing, is lacking. One of the signifcant challenges to analyzing manufacturing vocabulary is the lack of a named entity recognition model that enables algorithms to classify the manufacturing corpus of words under various manufacturing semantic categories. This work presents a supervised machine learning approach to categorize unstructured text from 500K+manufacturing science related scientifc abstracts and labelling them under various manufacturing topic categories. A neural network model using a bidirectional long-short term memory, plus a conditional random feld (BiLSTM+CRF) is trained to extract information from manufacturing science abstracts. Our classifer achieves an overall accuracy (f1-score) of 88%, which is quite near to the state-of-the-art performance. Two use case examples are presented that demonstrate the value of the developed NER model as a Technical Language Processingmore »(TLP) workfow on manufacturing science documents. The long term goal is to extract valuable knowledge regarding the connections and relationships between key manufacturing concepts/entities available within millions of manufacturing documents into a structured labeled-property graph data structure that allow for programmatic query and retrieval.« less
  3. Scene understanding is a key technical challenge within the autonomous driving domain. It requires a deep semantic understanding of the entities and relations found within complex physical and social environments that is both accurate and complete. In practice, this can be accomplished by representing entities in a scene and their relations as a knowledge graph (KG). This scene knowledge graph may then be utilized for the task of entity prediction, leading to improved scene understanding. In this paper, we will define and formalize this problem as Knowledge-based Entity Prediction (KEP). KEP aims to improve scene understanding by predicting potentially unrecognized entities by leveraging heterogeneous, high-level semantic knowledge of driving scenes. An innovative neuro-symbolic solution for KEP is presented, based on knowledge-infused learning, which 1) introduces a dataset agnostic ontology to describe driving scenes, 2) uses an expressive, holistic representation of scenes with knowledge graphs, and 3) proposes an effective, non-standard mapping of the KEP problem to the problem of link prediction (LP) using knowledge-graph embeddings (KGE). Using real, complex and high-quality data from urban driving scenes, we demonstrate its effectiveness by showing that the missing entities may be predicted with high precision (0.87 Hits@1) while significantly outperforming the non-semantic/rule-based baselines.
  4. Cyber Threat Intelligence (CTI) is information describing threat vectors, vulnerabilities, and attacks and is often used as training data for AI-based cyber defense systems such as Cybersecurity Knowledge Graphs (CKG). There is a strong need to develop community-accessible datasets to train existing AI-based cybersecurity pipelines to efficiently and accurately extract meaningful insights from CTI. We have created an initial unstructured CTI corpus from a variety of open sources that we are using to train and test cybersecurity entity models using the spaCy framework and exploring self-learning methods to automatically recognize cybersecurity entities. We also describe methods to apply cybersecurity domain entity linking with existing world knowledge from Wikidata. Our future work will survey and test spaCy NLP tools, and create methods for continuous integration of new information extracted from text.
  5. Industrial control systems (ICS) include systems that control industrial processes in critical infrastructure such as electric grids, nuclear power plants, manufacturing plans, water treatment systems, pharmaceutical plants, and building automation systems. ICS represent complex systems that contain an abundance of unique devices all of which may hold different types of software, including applications, firmware and operating systems. Due to their ability to control physical infrastructure, ICS have more and more become targets of cyber-attacks, increasing the risk of serious damage, negative financial impact, disruption to business operations, disruption to communities, and even the loss of life. Ethical hacking represents one way to test the security of ICS. Ethical hacking consists of using a cyber-attacker's perspective and a variety of cybersecurity tools to actively discover vulnerabilities and entry points for potential cyber-attacks. However, ICS ethical hacking represents a difficult task due to the wide variety of devices found on ICS networks. Most ethical hackers do not hold expertise or knowledge about ICS hardware, device computing elements, protocols, vulnerabilities found on these elements, and exploits used to exploit these vulnerabilities. Effective approaches are needed to reduce the complexity of ICS ethical hacking tasks. In this study, we use ontology modeling, a knowledgemore »representation approach in artificial intelligence (AI), to model data that represent ethical hacking tasks of building automation systems. With ontology modeling, information is stored and represented in the form of semantic graphs that express individuals, their properties, and the relations between multiple individuals. Data are drawn from sources such as the National Vulnerability Database, ExploitDB, Common Weakness Enumeration (CWE), the Common Attack Pattern and Enumeration Classification (CAPEC), and others. We show, through semantic queries, how the ontology model can automatically link together entities such as software names and versions of ICS software, vulnerabilities found on those software instances, vulnerabilities found on the protocols used by the software, exploits found on those vulnerabilities, weaknesses that represent those vulnerabilities, and attacks that can exploit those weaknesses. The ontology modeling of ICS ethical hacking and the semantic queries run over the model can reduce the complexity of ICS hacking tasks.« less