skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Data Quality for Security Challenges: Case Studies of Phishing, Malware and Intrusion Detection Datasets
Techniques from data science are increasingly being applied by researchers to security challenges. However, challenges unique to the security domain necessitate painstaking care for the models to be valid and robust. In this paper, we explain key dimensions of data quality relevant for security, illustrate them with several popular datasets for phishing, intrusion detection and malware, indicate operational methods for assuring data quality and seek to inspire the audience to generate high quality datasets for security challenges.  more » « less
Award ID(s):
1659755
PAR ID:
10137610
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Conference on Computer and Communications Security
Page Range / eLocation ID:
2605 to 2607
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Recent advances in deep learning have shown many successful stories in smart healthcare applications with data-driven insight into improving clinical institutions’ quality of care. Excellent deep learning models are heavily data-driven. The more data trained, the more robust and more generalizable the performance of the deep learning model. However, pooling the medical data into centralized storage to train a robust deep learning model faces privacy, ownership, and strict regulation challenges. Federated learning resolves the previous challenges with a shared global deep learning model using a central aggregator server. At the same time, patient data remain with the local party, maintaining data anonymity and security. In this study, first, we provide a comprehensive, up-to-date review of research employing federated learning in healthcare applications. Second, we evaluate a set of recent challenges from a data-centric perspective in federated learning, such as data partitioning characteristics, data distributions, data protection mechanisms, and benchmark datasets. Finally, we point out several potential challenges and future research directions in healthcare applications. 
    more » « less
  2. Sharing high-quality research data specifically for reuse in future work helps the scientific community progress by enabling researchers to build upon existing work and explore new research questions without duplicating data collection efforts. Because current discussions about research artifacts in Computer Security focus on reproducibility and availability of source code, the reusability of data is unclear. We examine data sharing practices in Computer Security and Measurement to provide resources and recommendations for sharing reusable data. Our study covers five years (2019–2023) and seven conferences in Computer Security and Measurement, identifying 948 papers that create a dataset as one of their contributions. We analyze the 265 accessible datasets, evaluating their under-standability and level of reuse. Our findings reveal inconsistent practices in data sharing structure and documentation, causing some datasets to not be shared effectively. Additionally, reuse of datasets is low, especially in fields where the nature of the data does not lend itself to reuse. Based on our findings, we offer data-driven recommendations and resources for improving data sharing practices in our community. Furthermore, we encourage authors to be intentional about their data sharing goals and align their sharing strategies with those goals. 
    more » « less
  3. The remarkable success of the use of machine learning-based solutions for network security problems has been impeded by the developed ML models’ inability to maintain efficacy when used in different network environments exhibiting different network behaviors. This issue is commonly referred to as the generalizability problem of ML models. The community has recognized the critical role that training datasets play in this context and has developed various techniques to improve dataset curation to overcome this problem. Unfortunately, these methods are generally ill-suited or even counterproductive in the network security domain, where they often result in unrealistic or poor-quality datasets. To address this issue, we propose a new closed-loop ML pipeline that leverages explainable ML tools to guide the network data collection in an iterative fashion. To ensure the data’s realism and quality, we require that the new datasets should be endogenously collected in this iterative process, thus advocating for a gradual removal of data-related problems to improve model generalizability. To realize this capability, we develop a data-collection platform, netUnicorn, that takes inspiration from the classic “hourglass” model and is implemented as its “thin waist" to simplify data collection for different learning problems from diverse network environments. The proposed system decouples data-collection intents from the deployment mechanisms and disaggregates these high-level intents into smaller reusable, self-contained tasks. We demonstrate how netUnicorn simplifies collecting data for different learning problems from multiple network environments and how the proposed iterative data collection improves a model’s generalizability 
    more » « less
  4. Abstract Recent advancements in wearable sensor technologies have enabled real-time monitoring of physiological and biochemical signals, opening new opportunities for personalized healthcare applications. However, conventional wearable devices often depend on rigid electronics components for signal transduction, processing, and wireless communications, leading to compromised signal quality due to the mechanical mismatches with the soft, flexible nature of human skin. Additionally, current computing technologies face substantial challenges in efficiently processing these vast datasets, with limitations in scalability, high power consumption, and a heavy reliance on external internet resources, which also poses security risks. To address these challenges, we have developed a miniaturized, standalone, chip-less wearable neuromorphic system capable of simultaneously monitoring, processing, and analyzing multimodal physicochemical biomarker data (i.e., metabolites, cardiac activities, and core body temperature). By leveraging scalable printing technology, we fabricated artificial synapses that function as both sensors and analog processing units, integrating them alongside printed synaptic nodes into a compact wearable system embedded with a medical diagnostic algorithm for multimodal data processing and decision making. The feasibility of this flexible wearable neuromorphic system was demonstrated in sepsis diagnosis and patient data classification, highlighting the potential of this wearable technology for real-time medical diagnostics. 
    more » « less
  5. Abstract In this commentary, we describe the current state of the art of points of interest (POIs) as digital, spatial datasets, both in terms of their quality and affordings, and how they are used across research domains. We argue that good spatial coverage and high-quality POI features — especially POI category and temporality information — are key for creating reliable data. We list challenges in POI geolocation and spatial representation, data fidelity, and POI attributes, and address how these challenges may affect the results of geospatial analyses of the built environment for applications in public health, urban planning, sustainable development, mobility, community studies, and sociology. This commentary is intended to shed more light on the importance of POIs both as standalone spatial datasets and as input to geospatial analyses. 
    more » « less