Modern machine learning underpins a large variety of commercial software products, including many cybersecurity solutions. Widely different models, from large transformers trained for auto-regressive natural language modeling to gradient boosting forests designed to recognize malicious software, all share a common element: they are trained on an ever increasing quantity of data to achieve impressive performance levels in their tasks. Consequently, the training phase of modern machine learning systems holds dual significance: it is pivotal in achieving the expected high-performance levels of these models, and concurrently, it presents a prime attack surface for adversaries striving to manipulate the behavior of the final trained system. This dissertation explores the complexities and hidden dangers of training supervised machine learning models in an adversarial setting, with a particular focus on models designed for cybersecurity tasks. Guided by the belief that an accurate understanding of the offensive capabilities of the adversary is the cornerstone on which to found any successful defensive strategy, the bulk of this thesis is composed by the introduction of novel training-time attacks. We start by proposing training-time attack strategies that operate in a clean-label regime, requiring minimal adversarial control over the training process, allowing the attacker to subvert the victim model’s prediction through simple poisoned data dissemination. Leveraging the characteristics of the data domain and model explanation techniques, we craft training data perturbations that stealthily subvert malicious software classifiers. We then shift the focus of our analysis on the long-standing problem of network flow traffic classification. In this context we develop new poisoning strategies that work around the constraints of the data domain through different strategies, including generative modeling. Finally, we examine unusual attack vectors, when the adversary is capable of tampering with different elements of the training process, such as the network connections during a federated learning protocol. We show that such an attacker can induce targeted performance degradation through strategic network interference, while maintaining stable the performance of the victim model on other data instances. We conclude by investigating mitigation techniques designed to target these insidious clean-label backdoor attacks in the cybersecurity domain.
more »
« less
Towards domain general detection of transactive knowledge building behavior
Support of discussion based learning at scale benefits from automated analysis of discussion for enabling effective assignment of students to project teams, for triggering dynamic support of group learning processes, and for assessment of those learning processes. A major limitation of much past work in machine learning applied to automated analysis of discussion is the failure of the models to generalize to data outside of the parameters of the context in which the training data was collected. This limitation means that a separate training effort must be undertaken for each domain in which the models will be used. This paper focuses on a specific construct of discussion based learning referred to as Transactivity and provides a novel machine learning approach with performance that exceeds state-of-the-art performance within the same domain in which it was trained and a new domain, and does not suffer any reduction in performance when transferring to the new domain. These results stand as an advance over past work on automated detection of Transactivity and increase the value of trained models for supporting group learning at scale. Implications for practice in at-scale learning environments are discussed.
more »
« less
- Award ID(s):
- 1302522
- PAR ID:
- 10080587
- Date Published:
- Journal Name:
- Learning at Scale
- Page Range / eLocation ID:
- 1 to 11
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Poole, Steve; Hernandez, Oscar; Baker, Matthew; Curtis, Tony (Ed.)SHMEM-ML is a domain specific library for distributed array computations and machine learning model training & inference. Like other projects at the intersection of machine learning and HPC (e.g. dask, Arkouda, Legate Numpy), SHMEM-ML aims to leverage the performance of the HPC software stack to accelerate machine learning workflows. However, it differs in a number of ways. First, SHMEM-ML targets the full machine learning workflow, not just model training. It supports a general purpose nd-array abstraction commonly used in Python machine learning applications, and efficiently distributes transformation and manipulation of this ndarray across the full system. Second, SHMEM-ML uses OpenSHMEM as its underlying communication layer, enabling high performance networking across hundreds or thousands of distributed processes. While most past work in high performance machine learning has leveraged HPC message passing communication models as a way to efficiently exchange model gradient updates, SHMEM-ML’s focus on the full machine learning lifecycle means that a more flexible and adaptable communication model is needed to support both fine and coarse grain communication. Third, SHMEM-ML works to interoperate with the broader Python machine learning software ecosystem. While some frameworks aim to rebuild that ecosystem from scratch on top of the HPC software stack, SHMEM-ML is built on top of Apache Arrow, an in-memory standard for data formatting and data exchange between libraries. This enables SHMEM-ML to share data with other libraries without creating copies of data. This paper describes the design, implementation, and evaluation of SHMEM-ML – demonstrating a general purpose system for data transformation and manipulation while achieving up to a 38× speedup in distributed training performance relative to the industry standard Horovod framework without a regression in model metrics.more » « less
-
Powder X‐ray diffraction (pXRD) experiments are a cornerstone for materials structure characterization. Despite their widespread application, analyzing pXRD diffractograms still presents a significant challenge to automation and a bottleneck in high‐throughput discovery in self‐driving labs. Machine learning promises to resolve this bottleneck by enabling automated powder diffraction analysis. A notable difficulty in applying machine learning to this domain is the lack of sufficiently sized experimental datasets, which has constrained researchers to train primarily on simulated data. However, models trained on simulated pXRD patterns showed limited generalization to experimental patterns, particularly for low‐quality experimental patterns with high noise levels and elevated backgrounds. With the Open Experimental Powder X‐ray Diffraction Database (opXRD), we provide an openly available and easily accessible dataset of labeled and unlabeled experimental powder diffractograms. Labeled opXRD data can be used to evaluate the performance of models on experimental data and unlabeled opXRD data can help improve the performance of models on experimental data, for example, through transfer learning methods. We collected 92,552 diffractograms, 2179 of them labeled, from a wide spectrum of material classes. We hope this ongoing effort can guide machine learning research toward fully automated analysis of pXRD data and thus enable future self‐driving materials labs.more » « less
-
In the past decade, academia and industry have embraced machine learning (ML) for database management system (DBMS) automation. These efforts have focused on designing ML models that predict DBMS behavior to support picking actions (e.g., building indexes) that improve the system's performance. Recent developments in ML have created automated methods for finding good models. Such advances shift the bottleneck from DBMS model design to obtaining the training data necessary for building these models. But generating good training data is challenging and requires encoding subject matter expertise into DBMS instrumentation. Existing methods for training data collection are bespoke to individual DBMS components and do not account for (1) how workload trends affect the system and (2) the subtle interactions between internal system components. Consequently, the models created from this data do not support holistic tuning across subsystems and require frequent retraining to boost their accuracy. This paper presents the architecture of a database gym, an integrated environment that provides a unified API of pluggable components for obtaining high-quality training data. The goal of a database gym is to simplify ML model training and evaluation to accelerate autonomous DBMS research. But unlike gyms in other domains that rely on custom simulators, a database gym uses the DBMS itself to create simulation environments for ML training. Thus, we discuss and prescribe methods for overcoming challenges in DBMS simulation, which include demanding requirements for performance, simulation fidelity, and DBMS-generated hints for guiding training processes.more » « less
-
The decompiler is one of the most common tools for examining executable binaries without the corresponding source code. It transforms binaries into high-level code, reversing the compilation process. Unfortunately, decompiler output is far from readable because the decompilation process is often incomplete. State-of-the-art techniques use machine learning to predict missing information like variable names. While these approaches are often able to suggest good variable names in context, no existing work examines how the selection of training data influences these machine learning models. We investigate how data provenance and the quality of training data affect performance, and how well, if at all, trained models generalize across software domains. We focus on the variable renaming problem using one such machine learning model, DIRE . We first describe DIRE in detail and the accompanying technique used to generate training data from raw code. We also evaluate DIRE ’s overall performance without respect to data quality. Next, we show how training on more popular, possibly higher quality code (measured using GitHub stars) leads to a more generalizable model because popular code tends to have more diverse variable names. Finally, we evaluate how well DIRE predicts domain-specific identifiers, propose a modification to incorporate domain information, and show that it can predict identifiers in domain-specific scenarios 23% more frequently than the original DIRE model.more » « less
An official website of the United States government

