skip to main content


Title: Practical State Recovery Attacks against Legacy RNG Implementations
The ANSI X9.17/X9.31 pseudorandom number generator design was first standardized in 1985, with variants incorporated into numerous cryptographic standards over the next three decades. The design uses timestamps together with a statically keyed block cipher to produce pseudo-random output. It has been known since 1998 that the key must remain secret in order for the output to be secure. However, neither the FIPS 140-2 standardization process nor NIST's later descriptions of the algorithm specified any process for key generation. We performed a systematic study of publicly available FIPS 140- 2 certifications for hundreds of products that implemented the ANSI X9.31 random number generator, and found twelve whose certification documents use of static, hard-coded keys in source code, leaving the implementation vulnerable to an attacker who can learn this key from the source code or binary. In order to demonstrate the practicality of such an attack, we develop a full passive decryption attack against FortiGate VPN gateway products using FortiOS v4 that recovers the private key in seconds. We measure the prevalence of this vulnerability on the visible Internet using active scans, and demonstrate state recovery and full private key recovery in the wild. Our work highlights the extent to which the validation and certification process has failed to provide even modest security guarantees.  more » « less
Award ID(s):
1651344
NSF-PAR ID:
10097174
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
CCS '18 Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security
Volume:
2018
Page Range / eLocation ID:
265 to 280
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The DeepLearningEpilepsyDetectionChallenge: design, implementation, andtestofanewcrowd-sourced AIchallengeecosystem Isabell Kiral*, Subhrajit Roy*, Todd Mummert*, Alan Braz*, Jason Tsay, Jianbin Tang, Umar Asif, Thomas Schaffter, Eren Mehmet, The IBM Epilepsy Consortium◊ , Joseph Picone, Iyad Obeid, Bruno De Assis Marques, Stefan Maetschke, Rania Khalaf†, Michal Rosen-Zvi† , Gustavo Stolovitzky† , Mahtab Mirmomeni† , Stefan Harrer† * These authors contributed equally to this work † Corresponding authors: rkhalaf@us.ibm.com, rosen@il.ibm.com, gustavo@us.ibm.com, mahtabm@au1.ibm.com, sharrer@au.ibm.com ◊ Members of the IBM Epilepsy Consortium are listed in the Acknowledgements section J. Picone and I. Obeid are with Temple University, USA. T. Schaffter is with Sage Bionetworks, USA. E. Mehmet is with the University of Illinois at Urbana-Champaign, USA. All other authors are with IBM Research in USA, Israel and Australia. Introduction This decade has seen an ever-growing number of scientific fields benefitting from the advances in machine learning technology and tooling. More recently, this trend reached the medical domain, with applications reaching from cancer diagnosis [1] to the development of brain-machine-interfaces [2]. While Kaggle has pioneered the crowd-sourcing of machine learning challenges to incentivise data scientists from around the world to advance algorithm and model design, the increasing complexity of problem statements demands of participants to be expert data scientists, deeply knowledgeable in at least one other scientific domain, and competent software engineers with access to large compute resources. People who match this description are few and far between, unfortunately leading to a shrinking pool of possible participants and a loss of experts dedicating their time to solving important problems. Participation is even further restricted in the context of any challenge run on confidential use cases or with sensitive data. Recently, we designed and ran a deep learning challenge to crowd-source the development of an automated labelling system for brain recordings, aiming to advance epilepsy research. A focus of this challenge, run internally in IBM, was the development of a platform that lowers the barrier of entry and therefore mitigates the risk of excluding interested parties from participating. The challenge: enabling wide participation With the goal to run a challenge that mobilises the largest possible pool of participants from IBM (global), we designed a use case around previous work in epileptic seizure prediction [3]. In this “Deep Learning Epilepsy Detection Challenge”, participants were asked to develop an automatic labelling system to reduce the time a clinician would need to diagnose patients with epilepsy. Labelled training and blind validation data for the challenge were generously provided by Temple University Hospital (TUH) [4]. TUH also devised a novel scoring metric for the detection of seizures that was used as basis for algorithm evaluation [5]. In order to provide an experience with a low barrier of entry, we designed a generalisable challenge platform under the following principles: 1. No participant should need to have in-depth knowledge of the specific domain. (i.e. no participant should need to be a neuroscientist or epileptologist.) 2. No participant should need to be an expert data scientist. 3. No participant should need more than basic programming knowledge. (i.e. no participant should need to learn how to process fringe data formats and stream data efficiently.) 4. No participant should need to provide their own computing resources. In addition to the above, our platform should further • guide participants through the entire process from sign-up to model submission, • facilitate collaboration, and • provide instant feedback to the participants through data visualisation and intermediate online leaderboards. The platform The architecture of the platform that was designed and developed is shown in Figure 1. The entire system consists of a number of interacting components. (1) A web portal serves as the entry point to challenge participation, providing challenge information, such as timelines and challenge rules, and scientific background. The portal also facilitated the formation of teams and provided participants with an intermediate leaderboard of submitted results and a final leaderboard at the end of the challenge. (2) IBM Watson Studio [6] is the umbrella term for a number of services offered by IBM. Upon creation of a user account through the web portal, an IBM Watson Studio account was automatically created for each participant that allowed users access to IBM's Data Science Experience (DSX), the analytics engine Watson Machine Learning (WML), and IBM's Cloud Object Storage (COS) [7], all of which will be described in more detail in further sections. (3) The user interface and starter kit were hosted on IBM's Data Science Experience platform (DSX) and formed the main component for designing and testing models during the challenge. DSX allows for real-time collaboration on shared notebooks between team members. A starter kit in the form of a Python notebook, supporting the popular deep learning libraries TensorFLow [8] and PyTorch [9], was provided to all teams to guide them through the challenge process. Upon instantiation, the starter kit loaded necessary python libraries and custom functions for the invisible integration with COS and WML. In dedicated spots in the notebook, participants could write custom pre-processing code, machine learning models, and post-processing algorithms. The starter kit provided instant feedback about participants' custom routines through data visualisations. Using the notebook only, teams were able to run the code on WML, making use of a compute cluster of IBM's resources. The starter kit also enabled submission of the final code to a data storage to which only the challenge team had access. (4) Watson Machine Learning provided access to shared compute resources (GPUs). Code was bundled up automatically in the starter kit and deployed to and run on WML. WML in turn had access to shared storage from which it requested recorded data and to which it stored the participant's code and trained models. (5) IBM's Cloud Object Storage held the data for this challenge. Using the starter kit, participants could investigate their results as well as data samples in order to better design custom algorithms. (6) Utility Functions were loaded into the starter kit at instantiation. This set of functions included code to pre-process data into a more common format, to optimise streaming through the use of the NutsFlow and NutsML libraries [10], and to provide seamless access to the all IBM services used. Not captured in the diagram is the final code evaluation, which was conducted in an automated way as soon as code was submitted though the starter kit, minimising the burden on the challenge organising team. Figure 1: High-level architecture of the challenge platform Measuring success The competitive phase of the "Deep Learning Epilepsy Detection Challenge" ran for 6 months. Twenty-five teams, with a total number of 87 scientists and software engineers from 14 global locations participated. All participants made use of the starter kit we provided and ran algorithms on IBM's infrastructure WML. Seven teams persisted until the end of the challenge and submitted final solutions. The best performing solutions reached seizure detection performances which allow to reduce hundred-fold the time eliptologists need to annotate continuous EEG recordings. Thus, we expect the developed algorithms to aid in the diagnosis of epilepsy by significantly shortening manual labelling time. Detailed results are currently in preparation for publication. Equally important to solving the scientific challenge, however, was to understand whether we managed to encourage participation from non-expert data scientists. Figure 2: Primary occupation as reported by challenge participants Out of the 40 participants for whom we have occupational information, 23 reported Data Science or AI as their main job description, 11 reported being a Software Engineer, and 2 people had expertise in Neuroscience. Figure 2 shows that participants had a variety of specialisations, including some that are in no way related to data science, software engineering, or neuroscience. No participant had deep knowledge and experience in data science, software engineering and neuroscience. Conclusion Given the growing complexity of data science problems and increasing dataset sizes, in order to solve these problems, it is imperative to enable collaboration between people with differences in expertise with a focus on inclusiveness and having a low barrier of entry. We designed, implemented, and tested a challenge platform to address exactly this. Using our platform, we ran a deep-learning challenge for epileptic seizure detection. 87 IBM employees from several business units including but not limited to IBM Research with a variety of skills, including sales and design, participated in this highly technical challenge. 
    more » « less
  2. An increasing number of Trusted Execution Environment (TEE) is adopting to a variety of commercial products for protecting data security on the cloud. However, TEEs are still exposed to various side-channel vulnerabilities, such as execution order-based, timing-based, and power-based vulnerabilities. While recent hardware is applying various techniques to mitigate order-based and timing-based side-channel vulnerabilities, power-based side-channel attacks remain a concern of hardware security, especially for the confidential computing settings where the server machines are beyond the control of cloud users. In this paper, we present PWRLEAK, an attack framework that exploits AMD’s power reporting interfaces to build power side-channel attacks against AMD Secure Encrypted Virtualization (SEV)-protected VM. We design and implement the attack framework with three general steps: (1) identify the instruction running inside AMD SEV, (2) apply a power interpolator to amplify power consumption, including an emulation-based interpolator for analyzing purposes and a moregeneral interrupt-based interpolator, and (3) infer secrets with various analysis approaches. A case study of using the emulation-based interpolator to infer the whole JPEG images processed by libjpeg demonstrates its ability to help analyze power consumption inside SEV VM. Our end-to-end attacks against Intel’s Integrated Performance Primitives (Intel IPP) library indicates that PWRLEAK can be exploited to infer RSA private keys with over 80% accuracy using the interrupt based interpolator. 
    more » « less
  3. This paper explores the security of a single-stage residue number system (RNS) pseudorandom number generator (PRNG), which has previously been shown to provide extremely high-quality outputs when evaluated through available RNG statistical test suites or in using Shannon and single-stage Kolmogorov entropy metrics. In contrast, rather than blindly performing statistical analyses on the outputs of the single-stage RNS PRNG, this paper provides both white box and black box analyses that facilitate reverse engineering of the underlying RNS number generation algorithm to obtain the residues, or equivalently key, of the RNS algorithm. We develop and demonstrate a conditional entropy analysis that permits extraction of the key given a priori knowledge of state transitions as well as reverse engineering of the RNS PRNG algorithm and parameters (but not the key) in problems where the multiplicative RNS characteristic is too large to obtain a priori state transitions. We then discuss multiple defenses and perturbations for the RNS system that fool the original attack algorithm, including deliberate noise injection and code hopping. We present a modification to the algorithm that accounts for deliberate noise, but rapidly increases the search space and complexity. Lastly, we discuss memory requirements and time required for the attacker and defender to maintain these defenses. 
    more » « less
  4. Memory-hard functions (MHFs) are a key cryptographic primitive underlying the design of moderately expensive password hashing algorithms and egalitarian proofs of work. Over the past few years several increasingly stringent goals for an MHF have been proposed including the requirement that the MHF have high sequential space-time (ST) complexity, parallel space-time complexity, amortized area-time (aAT) complexity and sustained space complexity. Data-Independent Memory Hard Functions (iMHFs) are of special interest in the context of password hashing as they naturally resist side-channel attacks. iMHFs can be specified using a directed acyclic graph (DAG) $G$ with $N=2^n$ nodes and low indegree and the complexity of the iMHF can be analyzed using a pebbling game. Recently, Alwen et al. [CCS'17] constructed an DAG called DRSample which has aAT complexity at least $\Omega\left( N^2/\log N\right)$. Asymptotically DRSample outperformed all prior iMHF constructions including Argon2i, winner of the password hashing competition (aAT cost $\mathcal{O}\left(N^{1.767}\right)$), though the constants in these bounds are poorly understood. We show that the the greedy pebbling strategy of Boneh et al. [ASIACRYPT'16] is particularly effective against DRSample e.g., the aAT cost is $\mathcal{O}\left( N^2/\log N\right)$. In fact, our empirical analysis {\em reverses} the prior conclusion of Alwen et al. that DRSample provides stronger resistance to known pebbling attacks for practical values of $N \leq 2^{24}$. We construct a new iMHF candidate (DRSample+BRG) by using the bit-reversal graph to extend DRSample. We then prove that the construction is asymptotically optimal under every MHF criteria, and we empirically demonstrate that our iMHF provides the best resistance to {\em known} pebbling attacks. For example, we show that any parallel pebbling attack either has aAT cost $\omega(N^2)$ or requires at least $\Omega(N)$ steps with $\Omega(N/\log N)$ pebbles on the DAG. This makes our construction the first practical iMHF with a strong sustained space-complexity guarantee and immediately implies that any parallel pebbling has aAT complexity $\Omega(N^2/\log N)$. We also prove that any sequential pebbling (including the greedy pebbling attack) has aAT cost $\Omega\left( N^2\right)$ and, if a plausible conjecture holds, any parallel pebbling has aAT cost $\Omega(N^2 \log \log N/\log N)$ --- the best possible bound for an iMHF. We implement our new iMHF and demonstrate that it is just as fast as Argon2. Along the way we propose a simple modification to the Argon2 round function which increases an attacker's aAT cost by nearly an order of magnitude without increasing running time on a CPU. Finally, we give a pebbling reduction which proves that in the parallel random oracle model (PROM) the cost of evaluating an iMHF like Argon2i or DRSample+BRG is given by the pebbling cost of the underlying DAG. Prior pebbling reductions assumed that the iMHF round function concatenates input labels before hashing and did not apply to practical iMHFs such as Argon2i, DRSample or DRSample+BRG where input labels are instead XORed together. 
    more » « less
  5. Obeid, I. ; Selesnik, I. ; Picone, J. (Ed.)
    The Neuronix high-performance computing cluster allows us to conduct extensive machine learning experiments on big data [1]. This heterogeneous cluster uses innovative scheduling technology, Slurm [2], that manages a network of CPUs and graphics processing units (GPUs). The GPU farm consists of a variety of processors ranging from low-end consumer grade devices such as the Nvidia GTX 970 to higher-end devices such as the GeForce RTX 2080. These GPUs are essential to our research since they allow extremely compute-intensive deep learning tasks to be executed on massive data resources such as the TUH EEG Corpus [2]. We use TensorFlow [3] as the core machine learning library for our deep learning systems, and routinely employ multiple GPUs to accelerate the training process. Reproducible results are essential to machine learning research. Reproducibility in this context means the ability to replicate an existing experiment – performance metrics such as error rates should be identical and floating-point calculations should match closely. Three examples of ways we typically expect an experiment to be replicable are: (1) The same job run on the same processor should produce the same results each time it is run. (2) A job run on a CPU and GPU should produce identical results. (3) A job should produce comparable results if the data is presented in a different order. System optimization requires an ability to directly compare error rates for algorithms evaluated under comparable operating conditions. However, it is a difficult task to exactly reproduce the results for large, complex deep learning systems that often require more than a trillion calculations per experiment [5]. This is a fairly well-known issue and one we will explore in this poster. Researchers must be able to replicate results on a specific data set to establish the integrity of an implementation. They can then use that implementation as a baseline for comparison purposes. A lack of reproducibility makes it very difficult to debug algorithms and validate changes to the system. Equally important, since many results in deep learning research are dependent on the order in which the system is exposed to the data, the specific processors used, and even the order in which those processors are accessed, it becomes a challenging problem to compare two algorithms since each system must be individually optimized for a specific data set or processor. This is extremely time-consuming for algorithm research in which a single run often taxes a computing environment to its limits. Well-known techniques such as cross-validation [5,6] can be used to mitigate these effects, but this is also computationally expensive. These issues are further compounded by the fact that most deep learning algorithms are susceptible to the way computational noise propagates through the system. GPUs are particularly notorious for this because, in a clustered environment, it becomes more difficult to control which processors are used at various points in time. Another equally frustrating issue is that upgrades to the deep learning package, such as the transition from TensorFlow v1.9 to v1.13, can also result in large fluctuations in error rates when re-running the same experiment. Since TensorFlow is constantly updating functions to support GPU use, maintaining an historical archive of experimental results that can be used to calibrate algorithm research is quite a challenge. This makes it very difficult to optimize the system or select the best configurations. The overall impact of all of these issues described above is significant as error rates can fluctuate by as much as 25% due to these types of computational issues. Cross-validation is one technique used to mitigate this, but that is expensive since you need to do multiple runs over the data, which further taxes a computing infrastructure already running at max capacity. GPUs are preferred when training a large network since these systems train at least two orders of magnitude faster than CPUs [7]. Large-scale experiments are simply not feasible without using GPUs. However, there is a tradeoff to gain this performance. Since all our GPUs use the NVIDIA CUDA® Deep Neural Network library (cuDNN) [8], a GPU-accelerated library of primitives for deep neural networks, it adds an element of randomness into the experiment. When a GPU is used to train a network in TensorFlow, it automatically searches for a cuDNN implementation. NVIDIA’s cuDNN implementation provides algorithms that increase the performance and help the model train quicker, but they are non-deterministic algorithms [9,10]. Since our networks have many complex layers, there is no easy way to avoid this randomness. Instead of comparing each epoch, we compare the average performance of the experiment because it gives us a hint of how our model is performing per experiment, and if the changes we make are efficient. In this poster, we will discuss a variety of issues related to reproducibility and introduce ways we mitigate these effects. For example, TensorFlow uses a random number generator (RNG) which is not seeded by default. TensorFlow determines the initialization point and how certain functions execute using the RNG. The solution for this is seeding all the necessary components before training the model. This forces TensorFlow to use the same initialization point and sets how certain layers work (e.g., dropout layers). However, seeding all the RNGs will not guarantee a controlled experiment. Other variables can affect the outcome of the experiment such as training using GPUs, allowing multi-threading on CPUs, using certain layers, etc. To mitigate our problems with reproducibility, we first make sure that the data is processed in the same order during training. Therefore, we save the data from the last experiment and to make sure the newer experiment follows the same order. If we allow the data to be shuffled, it can affect the performance due to how the model was exposed to the data. We also specify the float data type to be 32-bit since Python defaults to 64-bit. We try to avoid using 64-bit precision because the numbers produced by a GPU can vary significantly depending on the GPU architecture [11-13]. Controlling precision somewhat reduces differences due to computational noise even though technically it increases the amount of computational noise. We are currently developing more advanced techniques for preserving the efficiency of our training process while also maintaining the ability to reproduce models. In our poster presentation we will demonstrate these issues using some novel visualization tools, present several examples of the extent to which these issues influence research results on electroencephalography (EEG) and digital pathology experiments and introduce new ways to manage such computational issues. 
    more » « less