skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Cache automaton
Finite State Automata are widely used to accelerate pattern matching in many emerging application domains like DNA sequencing and XML parsing. Conventional CPUs and compute-centric accelerators are bottlenecked by memory bandwidth and irregular memory access patterns in automata processing. We present Cache Automaton, which repurposes last-level cache for automata processing, and a compiler that automates the process of mapping large real world Non-Deterministic Finite Automata (NFAs) to the proposed architecture. Cache Automaton extends a conventional last-level cache architecture with components to accelerate two phases in NFA processing: state-match and state-transition. State-matching is made efficient using a sense-amplifier cycling technique that exploits spatial locality in symbol matches. State-transition is made efficient using a new compact switch architecture. By overlapping these two phases for adjacent symbols we realize an efficient pipelined design. We evaluate two designs, one optimized for performance and the other optimized for space, across a set of 20 diverse benchmarks. The performance optimized design provides a speedup of 15× over DRAM-based Micron's Automata Processor and 3840× speedup over processing in a conventional x86 CPU. The proposed design utilizes on an average 1.2MB of cache space across benchmarks, while consuming 2.3nJ of energy per input symbol. Our space optimized design can reduce the cache utilization to 0.72MB, while still providing a speedup of 9× over AP.  more » « less
Award ID(s):
1652294
PAR ID:
10047787
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
MICRO-50 '17 Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture
Page Range / eLocation ID:
259 to 272
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Finite-state automata serve as compute kernels for many application domains such as pattern matching and data analytics. Existing approaches on GPUs exploit three levels of parallelism in automata processing tasks: 1)~input stream level, 2)~automaton-level and 3)~state-level. Among these, only state-level parallelism is intrinsic to automata while the other two levels of parallelism depend on the number of automata and input streams to be processed. As GPU resources increase, a parallelism-limited automata processing task can underutilize GPU compute resources. To this end, we propose AsyncAP, a low-overhead approach that optimizes for both scalability and throughput. Our insight is that most automata processing tasks have an additional source of parallelism originating from the input symbols which has not been leveraged before. Making the matching process associated with the automata tasks asynchronous, i.e., parallel GPU threads start processing an input stream from different input locations instead of processing it serially, improves throughput significantly, and scales with input length. When the task does not have enough parallelism to utilize all the GPU cores, detailed evaluation across 12 evaluated applications shows that AsyncAP achieves up to 58× speedup on average over the state-of-the-art GPU automata processing engine. When the tasks have enough parallelism to utilize GPU cores, AsyncAP still achieves 2.4× speedup. 
    more » « less
  2. Recommendation systems have been widely embedded into many Internet services. For example, Meta’s deep learning recommendation model (DLRM) shows high predictive accuracy of click-through rate in processing large-scale embedding tables. The SparseLengthSum (SLS) kernel of the DLRM dominates the inference time of the DLRM due to intensive irregular memory accesses to the embedding vectors. Some prior works directly adopt near-data processing (NDP) solutions to obtain higher memory bandwidth to accelerate SLS. However, their inferior memory hierarchy induces a low performance-cost ratio and fails to fully exploit the data locality. Although some software-managed cache policies were proposed to improve the cache hit rate, the incurred cache miss penalty is unacceptable considering the high overheads of executing the corresponding programs and the communication between the host and the accelerator. To address the issues aforementioned, we proposeEMS-i, an efficient memory system design that integrates Solid State Drive (SSD) into the memory hierarchy using Compute Express Link (CXL) for recommendation system inference. We specialize the caching mechanism according to the characteristics of various DLRM workloads and propose a novel prefetching mechanism to further improve the performance. In addition, we delicately design the inference kernel and develop a customized mapping scheme for SLS operation, considering the multi-level parallelism in SLS and the data locality within a batch of queries. Compared to the state-of-the-art NDP solutions,EMS-iachieves up to 10.9× speedup over RecSSD and the performance comparable to RecNMP with 72% energy savings.EMS-ialso saves up to 8.7× and 6.6 × memory cost w.r.t. RecSSD and RecNMP, respectively. 
    more » « less
  3. null (Ed.)
    The Reservoir Computing, a neural computing framework suited for temporal information processing, utilizes a dynamic reservoir layer for high-dimensional encoding, enhancing the separability of the network. In this paper, we exploit a Deep Learning (DL)-based detection strategy for Multiple-input, Multiple-output Orthogonal Frequency-Division Multiplexing (MIMO-OFDM) symbol detection. To be specific, we introduce a Deep Echo State Network (DESN), a unique hierarchical processing structure with multiple time intervals, to enhance the memory capacity and accelerate the detection efficiency. The resulting hardware prototype with the hybrid memristor-CMOS co-design provides in-memory computing and parallel processing capabilities, significantly reducing the hardware and power overhead. With the standard 180nm CMOS process and memristive synapses, the introduced DESN consumes merely 105mW of power consumption, exhibiting 16.7% power reduction compared to shallow ESN designs even with more dynamic layers and associated neurons. Furthermore, numerical evaluations demonstrate the advantages of the DESN over state-of-the-art detection techniques in the literate for MIMO-OFDM systems even with a very limited training set, yielding a 47.8% improvement against conventional symbol detection techniques. 
    more » « less
  4. Real-time decision making in IoT applications relies upon space-efficient evaluation of queries over streaming data. To model the uncertainty in the classification of data being processed, we consider the model of probabilistic strings --- sequences of discrete probability distributions over a finite set of events, and initiate the study of space complexity of streaming computation for different classes of queries over such probabilistic strings. We first consider the problem of computing the probability that a word, sampled from the distribution defined by the probabilistic string read so far, is accepted by a given deterministic finite automaton. We show that this regular pattern matching problem can be solved using space that is only poly-logarithmic in the string length (and polynomial in the size of the DFA) if we are allowed a multiplicative approximation error. Then we show how to generalize this result to quantitative queries specified by additive cost register automata --- these are automata that map strings to numerical values using finite control and registers that get updated using linear transformations. Finally, we consider the case when updates in such an automaton involve tests, and in particular, when there is a counter variable that can be either incremented or decremented but decrements only apply when the counter value is non-zero. In this case, the desired answer depends on the probability distribution over the set of possible counter values that can range from 0 to n for a string of length n. Under a mild assumption, namely probabilities of the individual events are bounded away from 0 and 1, we show that there is an algorithm that can compute all n entries of this probability distribution vector to within additive 1/poly(n) error using space that is only Õ(n). In establishing these results, we introduce several new technical ideas that may prove useful for designing space-efficient algorithms for other query models over probabilistic strings. 
    more » « less
  5. null (Ed.)
    As multicore systems continue to grow in scale and on-chip memory capacity, the on-chip network bandwidth and latency become problematic bottlenecks. Because of this, overheads in data transfer, the coherence protocol and replacement policies become increasingly important. Unfortunately, even in well-structured programs, many natural optimizations are difficult to implement because of the reactive and centralized nature of traditional cache hierarchies, where all requests are initiated by the core for short, cache line granularity accesses. For example, long-lasting access patterns could be streamed from shared caches without requests from the core. Indirect memory access can be performed by chaining requests made from within the cache, rather than constantly returning to the core. Our primary insight is that if programs can embed information about long-term memory stream behavior in their ISAs, then these streams can be floated to the appropriate level of the memory hierarchy. This decentralized approach to address generation and cache requests can lead to better cache policies and lower request and data traffic by proactively sending data before the cores even request it. To evaluate the opportunities of stream floating, we enhance a tiled multicore cache hierarchy with stream engines to process stream requests in last-level cache banks. We develop several novel optimizations that are facilitated by stream exposure in the ISA, and subsequent exposure to caches. We evaluate using a cycle-level execution-driven gem5-based simulator, using 10 data-processing workloads from Rodinia and 2 streaming kernels written in OpenMP. We find that stream floating enables 52% and 39% speedup over an inorder and OOO core with state of art prefetcher design respectively, with 64% and 49% energy efficiency advantage. 
    more » « less