skip to main content


Title: Greatly Accelerated Scaling of Streaming Problems with A Migrating Thread Architecture
Applications where continuous streams of data are passed through large data structures are becoming of increasing importance. However, their execution on conventional architectures, especially when parallelism is desired to boost performance, is highly inefficient. The primary issue is often with the need to stream large numbers of disparate data items through the equivalent of very large hash tables distributed across many nodes. This paper builds on some prior work on the Firehose streaming benchmark where an emerging architecture using threads that can migrate through memory has shown to be much more efficient at such problems. This paper extends that work to use a second generation system to not only show that same improved efficiency ( 10X) for larger core counts, but even significantly higher raw performance (with FPGA-based cores running at 1/10th the clock of conventional systems). Further, this additional data yields insight into what resources represent the bottlenecks to even more performance, and make a reasonable projection that implementation of such an architecture with current technology would lead to 10X performance gain on an apples-to-apples basis with conventional systems.  more » « less
Award ID(s):
1822939
NSF-PAR ID:
10402606
Author(s) / Creator(s):
;
Date Published:
Journal Name:
2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)
Page Range / eLocation ID:
11 to 18
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Overcoming the conventional trade-off between throughput and bit error rate (BER) performance, versus computational complexity is a long-term challenge for uplink Multiple-Input Multiple-Output (MIMO) detection in base station design for the cellular 5G New Radio roadmap, as well as in next generation wireless local area networks. In this work, we present ParaMax, a MIMO detector architecture that for the first time brings to bear physics-inspired parallel tempering algorithmic techniques [28, 50, 67] on this class of problems. ParaMax can achieve near optimal maximum-likelihood (ML) throughput performance in the Large MIMO regime, Massive MIMO systems where the base station has additional RF chains, to approach the number of base station antennas, in order to support even more parallel spatial streams. ParaMax is able to achieve a near ML-BER performance up to 160 × 160 and 80 × 80 Large MIMO for low-order modulations such as BPSK and QPSK, respectively, only requiring less than tens of processing elements. With respect to Massive MIMO systems, in 12 × 24 MIMO with 16-QAM at SNR 16 dB, ParaMax achieves 330 Mbits/s near-optimal system throughput with 4--8 processing elements per subcarrier, which is approximately 1.4× throughput than linear detector-based Massive MIMO systems. 
    more » « less
  2. null (Ed.)
    Current, near-term quantum devices have shown great progress in the last several years culminating recently with a demonstration of quantum supremacy. In the medium-term, however, quantum machines will need to transition to greater reliability through error correction, likely through promising techniques like surface codes which are well suited for near-term devices with limited qubit connectivity. We discover quantum memory, particularly resonant cavities with transmon qubits arranged in a 2.5D architecture, can efficiently implement surface codes with substantial hardware savings and performance/fidelity gains. Specifically, we virtualize logical qubits by storing them in layers of qubit memories connected to each transmon. Surprisingly, distributing each logical qubit across many memories has a minimal impact on fault tolerance and results in substantially more efficient operations. Our design permits fast transversal application of CNOT operations between logical qubits sharing the same physical address (same set of cavities) which are 6x faster than standard lattice surgery CNOTs. We develop a novel embedding which saves approximately 10x in transmons with another 2x savings from an additional optimization for compactness. Although qubit virtualization pays a 10x penalty in serialization, advantages in the transversal CNOT and in area efficiency result in fault-tolerance and performance comparable to conventional 2D transmon-only architectures. Our simulations show our system can achieve fault tolerance comparable to conventional two-dimensional grids while saving substantial hardware. Furthermore, our architecture can produce magic states at 1.22x the baseline rate given a fixed number of transmon qubits. This is a critical benchmark for future fault-tolerant quantum computers as magic states are essential and machines will spend the majority of their resources continuously producing them. This architecture substantially reduces the hardware requirements for fault-tolerant quantum computing and puts within reach a proof-of-concept experimental demonstration of around 10 logical qubits, requiring only 11 transmons and 9 attached cavities in total. 
    more » « less
  3. Obeid, I. ; Selesnik, I. ; Picone, J. (Ed.)
    The Neuronix high-performance computing cluster allows us to conduct extensive machine learning experiments on big data [1]. This heterogeneous cluster uses innovative scheduling technology, Slurm [2], that manages a network of CPUs and graphics processing units (GPUs). The GPU farm consists of a variety of processors ranging from low-end consumer grade devices such as the Nvidia GTX 970 to higher-end devices such as the GeForce RTX 2080. These GPUs are essential to our research since they allow extremely compute-intensive deep learning tasks to be executed on massive data resources such as the TUH EEG Corpus [2]. We use TensorFlow [3] as the core machine learning library for our deep learning systems, and routinely employ multiple GPUs to accelerate the training process. Reproducible results are essential to machine learning research. Reproducibility in this context means the ability to replicate an existing experiment – performance metrics such as error rates should be identical and floating-point calculations should match closely. Three examples of ways we typically expect an experiment to be replicable are: (1) The same job run on the same processor should produce the same results each time it is run. (2) A job run on a CPU and GPU should produce identical results. (3) A job should produce comparable results if the data is presented in a different order. System optimization requires an ability to directly compare error rates for algorithms evaluated under comparable operating conditions. However, it is a difficult task to exactly reproduce the results for large, complex deep learning systems that often require more than a trillion calculations per experiment [5]. This is a fairly well-known issue and one we will explore in this poster. Researchers must be able to replicate results on a specific data set to establish the integrity of an implementation. They can then use that implementation as a baseline for comparison purposes. A lack of reproducibility makes it very difficult to debug algorithms and validate changes to the system. Equally important, since many results in deep learning research are dependent on the order in which the system is exposed to the data, the specific processors used, and even the order in which those processors are accessed, it becomes a challenging problem to compare two algorithms since each system must be individually optimized for a specific data set or processor. This is extremely time-consuming for algorithm research in which a single run often taxes a computing environment to its limits. Well-known techniques such as cross-validation [5,6] can be used to mitigate these effects, but this is also computationally expensive. These issues are further compounded by the fact that most deep learning algorithms are susceptible to the way computational noise propagates through the system. GPUs are particularly notorious for this because, in a clustered environment, it becomes more difficult to control which processors are used at various points in time. Another equally frustrating issue is that upgrades to the deep learning package, such as the transition from TensorFlow v1.9 to v1.13, can also result in large fluctuations in error rates when re-running the same experiment. Since TensorFlow is constantly updating functions to support GPU use, maintaining an historical archive of experimental results that can be used to calibrate algorithm research is quite a challenge. This makes it very difficult to optimize the system or select the best configurations. The overall impact of all of these issues described above is significant as error rates can fluctuate by as much as 25% due to these types of computational issues. Cross-validation is one technique used to mitigate this, but that is expensive since you need to do multiple runs over the data, which further taxes a computing infrastructure already running at max capacity. GPUs are preferred when training a large network since these systems train at least two orders of magnitude faster than CPUs [7]. Large-scale experiments are simply not feasible without using GPUs. However, there is a tradeoff to gain this performance. Since all our GPUs use the NVIDIA CUDA® Deep Neural Network library (cuDNN) [8], a GPU-accelerated library of primitives for deep neural networks, it adds an element of randomness into the experiment. When a GPU is used to train a network in TensorFlow, it automatically searches for a cuDNN implementation. NVIDIA’s cuDNN implementation provides algorithms that increase the performance and help the model train quicker, but they are non-deterministic algorithms [9,10]. Since our networks have many complex layers, there is no easy way to avoid this randomness. Instead of comparing each epoch, we compare the average performance of the experiment because it gives us a hint of how our model is performing per experiment, and if the changes we make are efficient. In this poster, we will discuss a variety of issues related to reproducibility and introduce ways we mitigate these effects. For example, TensorFlow uses a random number generator (RNG) which is not seeded by default. TensorFlow determines the initialization point and how certain functions execute using the RNG. The solution for this is seeding all the necessary components before training the model. This forces TensorFlow to use the same initialization point and sets how certain layers work (e.g., dropout layers). However, seeding all the RNGs will not guarantee a controlled experiment. Other variables can affect the outcome of the experiment such as training using GPUs, allowing multi-threading on CPUs, using certain layers, etc. To mitigate our problems with reproducibility, we first make sure that the data is processed in the same order during training. Therefore, we save the data from the last experiment and to make sure the newer experiment follows the same order. If we allow the data to be shuffled, it can affect the performance due to how the model was exposed to the data. We also specify the float data type to be 32-bit since Python defaults to 64-bit. We try to avoid using 64-bit precision because the numbers produced by a GPU can vary significantly depending on the GPU architecture [11-13]. Controlling precision somewhat reduces differences due to computational noise even though technically it increases the amount of computational noise. We are currently developing more advanced techniques for preserving the efficiency of our training process while also maintaining the ability to reproduce models. In our poster presentation we will demonstrate these issues using some novel visualization tools, present several examples of the extent to which these issues influence research results on electroencephalography (EEG) and digital pathology experiments and introduce new ways to manage such computational issues. 
    more » « less
  4. null (Ed.)
    Applications where streams of data are passed through large data structures are becoming of increasing importance. For instance network intrusion detection and cyber security as a whole rely on real time analysis of network traffic. Unfortunately, when implemented on conventional architectures such applications become horribly inefficient, especially when attempts are made to scale up performance via some sort of parallelism. An earlier paper discussed streaming anomaly detection within a stream having an unbounded range of keys on the Lucata migrating thread architecture. In this paper we introduce \textit{Deluge}, a new implementation that addresses several inadequacies of previous designs and seeks to more directly target the hardware efficiencies inherent to migratory execution within a PGAS address space. Deluge achieves major improvements in hardware efficiency, throughput, and scalability over previous implementations. 
    more » « less
  5. Remote health monitoring is a powerful tool to provide preventive care and early intervention for populations-at-risk. Such monitoring systems are becoming available nowadays due to recent advancements in Internet-of-Things (IoT) paradigms, enabling ubiquitous monitoring. These systems require a high level of quality in attributes such as availability and accuracy due to patients critical conditions in the monitoring. Deep learning methods are very promising in such health applications to obtain a satisfactory performance, where a considerable amount of data is available. These methods are perfectly positioned in the cloud servers in a centralized cloud-based IoT system. However, the response time and availability of these systems highly depend on the quality of Internet connection. On the other hand, smart gateway devices are unable to implement deep learning methods (such as training models) due to their limited computational capacities. In our previous work, we proposed a hierarchical computing architecture (HiCH), where both edge and cloud computing resources were efficiently exploited, allocating heavy tasks of a conventional machine learning method to the cloud servers and outsourcing the hypothesis function to the edge. Due to this local decision making, the availability of the system was highly improved. In this paper, we investigate the feasibility of deploying the Convolutional Neural Network (CNN) based classification model as an example of deep learning methods in this architecture. Therefore, the system benefits from the features of the HiCH and the CNN, ensuring a high-level availability and accuracy. We demonstrate a real-time health monitoring for a case study on ECG classifications and evaluate the performance of the system in terms of response time and accuracy. 
    more » « less