skip to main content


This content will become publicly available on July 13, 2024

Title: Fast Falcon Signature Generation and Verification Using ARMv8 NEON Instructions
We present our speed records for Falcon signature generation and verification on ARMv8-A architecture. Our implementations are benchmarked on Apple M1 ‘Firestorm’, Raspberry Pi 4 Cortex-A72, and Jetson AGX Xavier. Our optimized signature generation is 2x slower, but signature verification is 3–3.9x faster than the state-of-the-art CRYSTALS-Dilithium implementation on the same platforms. Faster signature verification may be particularly useful for the client side on con-strained devices. Our Falcon implementation outperforms the previous work targeting Jetson AGX Xavier by the factors 1.48x for signing in falcon512 and falcon1024, 1.52x for verifying in falcon512, and 1.70x for verifying in falcon1024. We achieve improvement in Falcon signature generation by supporting a larger subset of possible parameter values for FFT-related functions and applying our compressed twiddle-factor table to reduce memory usage. We also demonstrate that the recently proposed signature scheme Hawk, sharing optimized functionality with Falcon, has 3.3x faster signature generation and 1.6–1.9x slower signature verification when implemented on the same ARMv8 processors as Falcon.  more » « less
Award ID(s):
1801512
NSF-PAR ID:
10438497
Author(s) / Creator(s):
;
Editor(s):
El Mrabet, N.; De Feo, L.; Duquesne, S.
Date Published:
Journal Name:
14th International Conference on Cryptology, AFRICACRYPT 2023
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Efficient and adaptive computer vision systems have been proposed to make computer vision tasks, such as image classification and object detection, optimized for embedded or mobile devices. These solutions, quite recent in their origin, focus on optimizing the model (a deep neural network, DNN) or the system by designing an adaptive system with approximation knobs. Despite several recent efforts, we show that existing solutions suffer from two major drawbacks. First , while mobile devices or systems-on-chips (SOCs) usually come with limited resources including battery power, most systems do not consider the energy consumption of the models during inference. Second , they do not consider the interplay between the three metrics of interest in their configurations, namely, latency, accuracy, and energy. In this work, we propose an efficient and adaptive video object detection system — Virtuoso , which is jointly optimized for accuracy, energy efficiency, and latency. Underlying Virtuoso is a multi-branch execution kernel that is capable of running at different operating points in the accuracy-energy-latency axes, and a lightweight runtime scheduler to select the best fit execution branch to satisfy the user requirement. We position this work as a first step in understanding the suitability of various object detection kernels on embedded boards in the accuracy-latency-energy axes, opening the door for further development in solutions customized to embedded systems and for benchmarking such solutions. Virtuoso is able to achieve up to 286 FPS on the NVIDIA Jetson AGX Xavier board, which is up to 45 times faster than the baseline EfficientDet D3 and 15 times faster than the baseline EfficientDet D0. In addition, we also observe up to 97.2% energy reduction using Virtuoso compared to the baseline YOLO (v3) — a widely used object detector designed for mobiles. To fairly compare with Virtuoso , we benchmark 15 state-of-the-art or widely used protocols, including Faster R-CNN (FRCNN) [NeurIPS’15], YOLO v3 [CVPR’16], SSD [ECCV’16], EfficientDet [CVPR’20], SELSA [ICCV’19], MEGA [CVPR’20], REPP [IROS’20], FastAdapt [EMDL’21], and our in-house adaptive variants of FRCNN+, YOLO+, SSD+, and EfficientDet+ (our variants have enhanced efficiency for mobiles). With this comprehensive benchmark, Virtuoso has shown superiority to all the above protocols, leading the accuracy frontier at every efficiency level on NVIDIA Jetson mobile GPUs. Specifically, Virtuoso has achieved an accuracy of 63.9%, which is more than 10% higher than some of the popular object detection models, FRCNN at 51.1%, and YOLO at 49.5%. 
    more » « less
  2. Recently, point cloud (PC) has gained popularity in modeling various 3D objects (including both synthetic and real-life) and has been extensively utilized in a wide range of applications such as AR/VR, 3D reconstruction, and autonomous driving. For such applications, it is critical to analyze/understand the surrounding scenes properly. To achieve this, deep learning based methods (e.g., convolutional neural networks (CNNs)) have been widely employed for higher accuracy. Unlike the deep learning on conventional 2D images/videos, where the feature computation (matrix multiplication) is the major bottleneck, in point cloud-based CNNs, the sample and neighbor search stages are the primary bottlenecks, and collectively contribute to 54% (up to 80%) of the overall execution latency on a typical edge device. While prior efforts have attempted to solve this issue by designing custom ASICs or pipelining the neighbor search with other stages, to our knowledge, none of them has tried to “structurize” the unstructured PC data for improving computational efficiency. In this paper, we first explore the opportunities of structurizing PC data using Morton code (which is originally designed to map data from a high dimensional space to one dimension, while preserving spatial locality) and observe that there is a huge scope to “skip” the sample and neighbor search computation by operating on the “structurized” PC data. Based on this, we propose two approximation techniques for the sampling and neighbor search stages. We implemented our proposals on an NVIDIA Jetson AGX Xavier edge GPU board. The evaluation results collected on six different workloads show that our design can accelerate the sample and neighbor search stages by 3.68× (up to 5.21×) with minimal impact on inference accuracy. This acceleration in turn results in 1.55× speedup in the end-to-end execution latency and saves 33% of energy expenditure. 
    more » « less
  3. As Point Clouds (PCs) gain popularity in processing millions of data points for 3D rendering in many applications, efficient data compression becomes a critical issue. This is because compression is the primary bottleneck in minimizing the latency and energy consumption of existing PC pipelines. Data compression becomes even more critical as PC processing is pushed to edge devices with limited compute and power budgets. In this paper, we propose and evaluate two complementary schemes, intra-frame compression and inter-frame compression, to speed up the PC compression, without losing much quality or compression efficiency. Unlike existing techniques that use sequential algorithms, our first design, intra-frame compression, exploits parallelism for boosting the performance of both geometry and attribute compression. The proposed parallelism brings around 43.7× performance improvement and 96.6% energy savings at a cost of 1.01× larger compressed data size. To further improve the compression efficiency, our second scheme, inter-frame compression, considers the temporal similarity among the video frames and reuses the attribute data from the previous frame for the current frame. We implement our designs on an NVIDIA Jetson AGX Xavier edge GPU board. Experimental results with six videos show that the combined compression schemes provide 34.0× speedup compared to a state-of-the-art scheme, with minimal impact on quality and compression ratio. 
    more » « less
  4. With the burgeoning of autonomous driving, the edgedeployed integrated CPU/GPU (iGPU) platform gains significant attention from both academia and industries. NVIDIA issues a series of Jetson iGPU platforms that perform well in terms of computation capability, power consumption, and mobile size. However, these iGPU platforms typically contain very limited physical memory, which could be the bottleneck of these autonomous driving and edge computing applications. Although the introduction of the Unified Memory (UM) model in GPU programming can reduce the memory footprint, the programming legacy of the UM model initializes data on the CPU side by default as the conventional copyand- execute model does, which causes significant latency of application execution. In this paper, we propose an enhanced unified memory management model (eUMM), which delivers a prefetch-enhanced data initialization method on the GPU side of the iGPU platform. We evaluate eUMM on the representative Jetson TX2 and Xavier AGX platforms and demonstrate that eUMM not only reduces the initialization latency significantly but also benefits the following kernel computation and the entire application execution latency. 
    more » « less
  5. Public Key Infrastructure (PKI) generates and distributes digital certificates to provide the root of trust for securing digital networking systems. To continue securing digital networking in the quantum era, PKI should transition to use quantum-resistant cryptographic algorithms. The cryptography community is developing quantum-resistant primitives/algorithms, studying, and analyzing them for cryptanalysis and improvements. National Institute of Standards and Technology (NIST) selected finalist algorithms for the post-quantum digital signature cipher standardization, which are Dilithium, Falcon, and Rainbow. We study and analyze the feasibility and the processing performance of these algorithms in memory/size and time/speed when used for PKI, including the key generation from the PKI end entities (e.g., a HTTPS/TLS server), the signing, and the certificate generation by the certificate authority within the PKI. The transition to post-quantum from the classical ciphers incur changes in the parameters in the PKI, for example, Rainbow I significantly increases the certificate size by 163 times when compared with RSA 3072. Nevertheless, we learn that the current X.509 supports the NIST post-quantum digital signature ciphers and that the ciphers can be modularly adapted for PKI. According to our empirical implementations-based study, the post-quantum ciphers can increase the certificate verification time cost compared to the current classical cipher and therefore the verification overheads require careful considerations when using the post-quantum-cipher-based certificates. 
    more » « less