The Micron Automata Processor (AP) is a novel co-processor accelerator that supports the parallel execution of multiple Nondeterministic Finite Automata (NFA) programmed directly into hardware over a single data-stream. In this paper, we present a number of programming techniques to develop automata that execute efficiently on this processor. First, we present general techniques to transform NFAs defined in their classical representation to the representation used by the AP, and optimize the same. Then, we present automata development techniques using simple but powerful generic building blocks. All the above techniques are generic in nature and can be useful to application developers working on this new upcoming co-processor architecture.
more »
« less
High Performance Pattern Matching Using the Automata Processor
In this paper, we study the acceleration of applications that require searching for all occurrences of thousands of string-patterns in an input data-stream, using the Automata Processor (AP). For this purpose, we use two applications from two fields, namely, network security and bioinformatics. The first application, called Fast-SNAP (for Fast-SNort using AP), scans network data for 4312 signatures of intrusion derived from the popular open-source Snort database. Using the resources of a single AP board, Fast-SNAP can scan for all these signatures at 10.3 Gbps. The second application, called PROTOMATA (for PROTein autOMATA), looks for all occurrences of 1308 protein motifs from the PROSITE database in protein sequences. PROTOMATA is up to half a million times faster than its single-CPU-based counterpart. The techniques developed to program these applications may be useful in the design and development of similar applications using this new hardware accelerator.
more »
« less
- Award ID(s):
- 1448333
- PAR ID:
- 10124162
- Date Published:
- Journal Name:
- 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The Automata Processor was designed for string-pattern matching. In this paper, we showcase its use to execute integer and floating-point comparisons and apply the same to accelerate interval stabbing queries. An interval stabbing query determines which of the intervals in a set overlap a query point. Such queries are often used in computational geometry, pattern matching, database management systems, and geographic information systems. The check for each interval is programmed as a single automaton and multiple automata are executed in parallel to provide significant performance gains. While handling 32-bit integers or single-precision floating-point numbers, up to 2.75 trillion comparisons can be executed per second, whereas 0.79 trillion comparisons per second can be completed for 64-bit integers or double-precision floating-point numbers. Additionally, our solution leaves the intervals in the set unordered; allowing addition or deletion of an interval in constant time. This is not possible for contemporary solutions wherein the intervals are ordered, making the query times faster, but making the updating of intervals complex. Our automata designs exemplify techniques that maximize resource utilization and minimize performance bottlenecks, which may be useful to future application developers on this processor. Their modular design allows them to become constituent parts of larger automata, where the numerical comparisons are part of the overall pattern matching operation. We have validated the designs on hardware, and the routines to generate the necessary automata and execute them on the AP will be made available as software libraries shortly.more » « less
-
Finite State Automata are widely used to accelerate pattern matching in many emerging application domains like DNA sequencing and XML parsing. Conventional CPUs and compute-centric accelerators are bottlenecked by memory bandwidth and irregular memory access patterns in automata processing. We present Cache Automaton, which repurposes last-level cache for automata processing, and a compiler that automates the process of mapping large real world Non-Deterministic Finite Automata (NFAs) to the proposed architecture. Cache Automaton extends a conventional last-level cache architecture with components to accelerate two phases in NFA processing: state-match and state-transition. State-matching is made efficient using a sense-amplifier cycling technique that exploits spatial locality in symbol matches. State-transition is made efficient using a new compact switch architecture. By overlapping these two phases for adjacent symbols we realize an efficient pipelined design. We evaluate two designs, one optimized for performance and the other optimized for space, across a set of 20 diverse benchmarks. The performance optimized design provides a speedup of 15× over DRAM-based Micron's Automata Processor and 3840× speedup over processing in a conventional x86 CPU. The proposed design utilizes on an average 1.2MB of cache space across benchmarks, while consuming 2.3nJ of energy per input symbol. Our space optimized design can reduce the cache utilization to 0.72MB, while still providing a speedup of 9× over AP.more » « less
-
Finite-state automata serve as compute kernels for many application domains such as pattern matching and data analytics. Existing approaches on GPUs exploit three levels of parallelism in automata processing tasks: 1)~input stream level, 2)~automaton-level and 3)~state-level. Among these, only state-level parallelism is intrinsic to automata while the other two levels of parallelism depend on the number of automata and input streams to be processed. As GPU resources increase, a parallelism-limited automata processing task can underutilize GPU compute resources. To this end, we propose AsyncAP, a low-overhead approach that optimizes for both scalability and throughput. Our insight is that most automata processing tasks have an additional source of parallelism originating from the input symbols which has not been leveraged before. Making the matching process associated with the automata tasks asynchronous, i.e., parallel GPU threads start processing an input stream from different input locations instead of processing it serially, improves throughput significantly and scales with input length. When the task does not have enough parallelism to utilize all the GPU cores, detailed evaluation across 12 evaluated applications shows that AsyncAP achieves up to 58× speedup on average over the state-of-the-art GPU automata processing engine. When the tasks have enough parallelism to utilize GPU cores, AsyncAP still achieves 2.4× speedup.more » « less
-
Abstract Living in nutrient-poor environments, the carnivorous Venus flytrapDionaea muscipulacaptures animal prey to compensate for this deficiency. Stimulation of trigger hairs located on the inner trap surface elicits an action potential (AP). While two consecutive APs result in fast trap closure in wildtype (WT) plants, sustained AP generation by the insect struggling to escape the trap leads to jasmonic acid (JA) biosynthesis, formation of the digestive “stomach”, and release of enzymes needed to decompose the victim. TheDionaea muscipulaDYSCALCULIA (DYSC) mutant is able to fire touch-induced APs, but unlike WT plants, it does not snap-close its traps after two consecutive APs. Moreover, DYSC plants fail to properly initiate the JA pathway in response to mechanostimulation and even wounding, a well-known JA-dependent process conserved among plants. As demonstrated in previous studies, this DYSC mutant defect is associated with impaired decoding of mechanostimulation (i.e. touch) -induced Ca2+signals. External JA application to the trap, however, restores slow trap closure and digestive gland function in DYSC, while rapid trap closure is JA-independent and cannot be rescued by exogenous JA application. Higher frequency mechanostimulation and thus more APs, however, revealed that DYSC is still able to close its traps, albeit much slower than WT plants. To reveal the molecular underpinnings of DYSC’s delayed trap movement, we generated a chromosome-scaleDionaeagenome assembly and profiled gene expression. The refined transcriptomic analysis uncovered widespread misregulation of cell wall-related genes in DYSC, implicating altered cell wall plasticity in the sluggish mutant. Cell indentation studies by atomic force microscopy revealed a strictly localized and strikingly enhanced stiffening of the cell wall for DYSC that may hinder rapid trap closure and snap buckling. Together, these genomic, transcriptomic, and biophysical data identify cell wall elasticity as a key constraint on voltage and Ca2+dependent trap kinetics. This finding documents the interrelationship between mechanosensing and Ca2+signaling in the ultrafast capture organ of the Venus flytrap.more » « less
An official website of the United States government

