skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on March 1, 2026

Title: Parallel Trellis-Stage-Combining BCJR for High-Throughput CUDA Decoder of CCSDS SCPPM
The Consultative Committee for Space Data Systems (CCSDS) standard for high photon efficiency uses a serially-concatenated (SC) code to encode pulse position modulated laser light. A convolutional encoder serves as the outer code and an accumulator serves as the inner code. These two component codes are connected through an interleaver. This coding scheme is called Serially Concatenated convolutionally coded Pulse Position Modulation (SCPPM) and it is used for NASA's Deep Space Optical Communications (DSOC) experiment. For traditional decoding that traverses the trellis forwards and backwards according to the Bahl Cocke Jelinek and Raviv (BCJR) algorithm, the latency is on the order of the length of the trellis, which has 10,080 stages for the rate 2/3 DSOC code. This paper presents a novel alternative approach that simultaneously processes all trellis stages, successively combining pairs of stages into a meta-stage. This approach has latency that is on the order of the log base-2 of the number of stages. The new decoder is implemented using the Compute Unified Device Architecture (CUDA) platform on an Nvidia Graphics Processing Unit (GPU). Compared to Field Programmable Gate Array (FPGA) implementations, the GPU implementation offers easier development, scalability, and portability across GPU models. The GPU implementation provides a dramatic increase in speed that facilitates more thorough simulation as well as enables a shift from FPGA to GPU processors for DSOC ground stations.  more » « less
Award ID(s):
2008918
PAR ID:
10588914
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
IEEE
Date Published:
Format(s):
Medium: X
Location:
Yellowstone Conference Center in Big Sky, Montana
Sponsoring Org:
National Science Foundation
More Like this
  1. List Viterbi decoders are a very effective way to improve the performance of block codes in combination with an error detection outer code. In this work, we combine an efficient serial list Viterbi decoder design with an existing serially concatenated, convolutionally-encoded, pulse position modulated code (SCPPM) used in space communication, that exhibits poor performance because of an error floor. The SCPPM code features a 32-bit CRC that provides powerful error detection capability and an outer four-state convolutional code that makes it suitable for a list Viterbi decoder. The system’s code is very long, consisting of 15, 120 bits, which renders a high complexity decoder impractical, while the high error detection allows for a list decoder with very low undetected error probability. We use a very efficient list Viterbi decoder algorithm to avoid most of the redundant operations to produce low complexity serial list Viterbi decoder. The combined system reduces the error floor, moderately for the original version of the system, and completely suppresses it when the code length is increased to four times longer. 
    more » « less
  2. Random Forests (RFs) are a commonly used machine learning method for classification and regression tasks spanning a variety of application domains, including bioinformatics, business analytics, and software optimization. While prior work has focused primarily on improving performance of the training of RFs, many applications, such as malware identification, cancer prediction, and banking fraud detection, require fast RF classification. In this work, we accelerate RF classification on GPU and FPGA. In order to provide efficient support for large datasets, we propose a hierarchical memory layout suitable to the GPU/FPGA memory hierarchy. We design three RF classification code variants based on that layout, and we investigate GPU- and FPGA-specific considerations for these kernels. Our experimental evaluation, performed on an Nvidia Xp GPU and on a Xilinx Alveo U250 FPGA accelerator card using publicly available datasets on the scale of millions of samples and tens of features, covers various aspects. First, we evaluate the performance benefits of our hierarchical data structure over the standard compressed sparse row (CSR) format. Second, we compare our GPU implementation with cuML, a machine learning library targeting Nvidia GPUs. Third, we explore the performance/accuracy tradeoff resulting from the use of different tree depths in the RF. Finally, we perform a comparative performance analysis of our GPU and FPGA implementations. Our evaluation shows that, while reporting the best performance on GPU, our code variants outperform the CSR baseline both on GPU and FPGA. For high accuracy targets, our GPU implementation yields a 5-9 × speedup over CSR, and up to a 2 × speedup over Nvidia’s cuML library. 
    more » « less
  3. We present Grapple, a new and powerful framework for explicit-state model checking on GPUs. Grapple is based on swarm verification (SV), a model-checking technique wherein a collection or swarm of small, memory- and time-bounded verification tests (VTs) are run in parallel to perform state-space exploration. SV achieves high state-space coverage via diversification of the search strategies used by constituent VTs. Grapple represents a swarm implementation for the GPU. In particular, it runs a parallel swarm of internally-parallel VTs, which are implemented in a manner that specifically targets the GPU architecture and the SIMD parallelism its computing cores offer. Grapple also makes effective use of the GPU shared memory, eliminating costly inter-block communication overhead. We conducted a comprehensive performance analysis of Grapple focused on the various design parameters, including the size of the queue structure, implementation of guard statements, and nondeterministic exploration order. Tests are run with multiple hardware configurations, including on the Amazon cloud. Our results show that Grapple performs favorably compared to the SPIN swarm and a prior non-swarm GPU implementation. Although a recently debuted FPGA swarm is faster, the deployment process to the FPGA is much more complex than Grapple's. 
    more » « less
  4. Perceiving the position and orientation of objects (i.e., pose estimation) is a crucial prerequisite for robots acting within their natural environment. We present a hardware acceleration approach to enable real-time and energy efficient articulated pose estimation for robots operating in unstructured environments. Our hardware accelerator implements Nonparametric Belief Propagation (NBP) to infer the belief distribution of articulated object poses. Our approach is on average, 26× more energy efficient than a high-end GPU and 11× faster than an embedded low-power GPU implementation. Moreover, we present a Monte-Carlo Perception Library generated from high-level synthesis to enable reconfigurable hardware designs on FPGA fabrics that are better tuned to user-specified scene, resource, and performance constraints. 
    more » « less
  5. An expurgating linear function (ELF) is an outer code that disallows low-weight codewords of the inner code. ELFs can be designed either to maximize the minimum distance or to minimize the codeword error rate (CER) of the expurgated code. A list-decoding sieve can efficiently identify ELFs that maximize the minimum distance of the expurgated code. For convolutional inner codes, this paper provides analytical distance spectrum union (DSU) bounds on the CER of the concatenated code. For short codeword lengths, ELFs transform a good inner code into a great concatenated code. For a constant message size of K = 64 bits or constant codeword blocklength of N = 152 bits, an ELF can reduce the gap at CER 10−6 between the DSU and the random-coding union (RCU) bounds from over 1 dB for the inner code alone to 0.23 dB for the concatenated code. The DSU bounds can also characterize puncturing that mitigates the rate overhead of the ELF while maintaining the DSU-to-RCU gap. List Viterbi decoding guided by the ELF achieves maximum likelihood (ML) decoding of the concatenated code with a sufficiently large list size. The rate-K/(K+m) ELF outer code reduces rate and list decoding increases decoder complexity. As SNR increases, the average list size converges to 1 and average complexity is similar to Viterbi decoding on the trellis of the inner code. For rare large-magnitude noise events, which occur less often than the FER of the inner code, a deep search in the list finds the ML codeword. 
    more » « less