A fault-tolerant quantum computer must decode and correct errors faster than they appear. The faster errors can be corrected, the more time the computer can do useful work. The Union-Find (UF) decoder is promising with an average time complexity slightly higher than O(d3). We report a distributed version of the UF decoder that exploits parallel computing resources for further speedup. Using an FPGA-based implementation, we empirically show that this distributed UF decoder has a sublinear average time complexity with regard to d, given O(d3) parallel computing resources. The decoding time per measurement round decreases as d increases, a first time for a quantum error decoder. The implementation employs a scalable architecture called Helios that organizes parallel computing resources into a hybrid tree-grid structure. We are able to implement d up to 21 with a Xilinx VCU129 FPGA, for which an average decoding time is 11.5 ns per measurement round under phenomenological noise of 0.1%, significantly faster than any existing decoder implementation. Since the decoding time per measurement round of Helios decreases with d, Helios can decode a surface code of arbitrarily large d without a growing backlog.
more »
« less
FPGA-Based Distributed Union-Find Decoder for Surface Codes
A fault-tolerant quantum computer must decode and correct errors faster than they appear to prevent exponential slowdown due to error correction. The Union-Find (UF) decoder is promising with an average time complexity slightly higher than $O(d^3)$. We report a distributed version of the UF decoder that exploits parallel computing resources for further speedup. Using an FPGA-based implementation, we empirically show that this distributed UF decoder has a sublinear average time complexity with regard to $$d$$, given $O(d^3)$ parallel computing resources. The decoding time per measurement round decreases as $$d$$ increases, the first time for a quantum error decoder. The implementation employs a scalable architecture called Helios that organizes parallel computing resources into a hybrid tree-grid structure. Using a Xilinx VCU129 FPGA, we successfully implement $$d$$ up to 21 with an average decoding time of 11.5 ns per measurement round under 0.1\% phenomenological noise, and 23.7 ns for $$d=17$$ under equivalent circuit-level noise. This performance is significantly faster than any existing decoder implementation. Furthermore, we show that \name can optimize for resource efficiency by decoding $$d=51$ on a Xilinx VCU129 FPGA with an average latency of 544 ns per measurement round.
more »
« less
- Award ID(s):
- 2216030
- PAR ID:
- 10553492
- Publisher / Repository:
- IEEE
- Date Published:
- Journal Name:
- IEEE Transactions on Quantum Engineering
- Volume:
- 5
- ISSN:
- 2689-1808
- Page Range / eLocation ID:
- 1 to 18
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The Minimum-Weight Perfect Matching (MWPM) decoder is widely used in Quantum Error Correction (QEC) decoding. Despite its high accuracy, existing implementations of the MWPM decoder cannot catch up with quantum hardware, e.g., 1 million measurements per second for superconducting qubits. They suffer from a backlog of measurements that grows exponentially and as a result, cannot realize the power of quantum computation. We design and implement a fast MWPM decoder, called Parity Blossom, which reaches a time complexity almost proportional to the number of defect measurements. We further design and implement a parallel version of Parity Blossom called Fusion Blossom. Given a practical circuit-level noise of 0.1%, Fusion Blossom can decode a million measurement rounds per second up to a code distance of 33. Fusion Blossom also supports stream decoding mode that reaches a 0.7 ms decoding latency at code distance 21 regardless of the measurement rounds.more » « less
-
LDPC (Low-Density Parity-Check) codes have become a cornerstone of transforming a noise-filled physical channel into a reliable and high-performance data channel in communication and storage systems. FPGA (Field-Programmable Gate Array) based LDPC hardware, especially for decoding with high complexity, is essential to realizing the high-bandwidth channel prototypes. HLS (High-Level Synthesis) is introduced to speed up the FPGA development of LDPC hardware by automatically compiling high-level abstract behavioral descriptions into RTL-level implementations, but often sub-optimally due to lacking effective low-level descriptions. To overcome this problem, this paper proposes an HLS-friendly QC-LDPC FPGA decoder architecture, HF-LDPC, that employs HLS not only to precisely characterize high-level behaviors but also to effectively optimize low-level RTL implementation, thus achieving both high throughput and flexibility. First, HF-LDPC designs a multi-unit framework with a balanced I/O-computing dataflow to adaptively match code parameters with FPGA configurations. Second, HFLDPC presents a novel fine-grained task-level pipeline with interleaved updating to eliminate stalls due to data interdependence within each updating task. HF-LDPC also presents several HLSenhanced approaches. We implement and evaluate HF-LDPC on Xilinx U50, which demonstrates that HF-LDPC outperforms existing implementations by 4× to 84× with the same parameter and linearly scales to up to 116 Gbps actual decoding throughput with high hardware efficiency.more » « less
-
Abstract In practical quantum error correction implementations, the measurement of syndrome information is an unreliable step—typically modeled as a binary measurement outcome flipped with some probability. However, the measured syndrome is in fact a discretized value of the continuous voltage or current values obtained in the physical implementation of the syndrome extraction. In this paper, we use this “soft” or analog information to benefit iterative decoders for decoding quantum low-density parity-check (QLDPC) codes. Syndrome-based iterative belief propagation decoders are modified to utilize the soft syndrome to correct both data and syndrome errors simultaneously. We demonstrate the advantages of the proposed scheme not only in terms of comparison of thresholds and logical error rates for quasi-cyclic lifted-product QLDPC code families but also with faster convergence of iterative decoders. Additionally, we derive hardware (FPGA) architectures of these soft syndrome decoders and obtain similar performance in terms of error correction to the ideal models even with reduced precision in the soft information. The total latency of the hardware architectures is about 600 ns (for the QLDPC codes considered) in a 20 nm CMOS process FPGA device, and the area overhead is almost constant—less than 50% compared to min-sum decoders with noisy syndromes.more » « less
-
Abstract Running quantum algorithms protected by quantum error correction requires a real time, classical decoder. To prevent the accumulation of a backlog, this decoder must process syndromes from the quantum device at a faster rate than they are generated. Most prior work on real time decoding has focused on an isolated logical qubit encoded in the surface code. However, for surface code, quantum programs of utility will require multi-qubit interactions performed via lattice surgery. A large merged patch can arise during lattice surgery—possibly as large as the entire device. This puts a significant strain on a real time decoder, which must decode errors on this merged patch and maintain the level of fault-tolerance that it achieves on isolated logical qubits. These requirements are relaxed by using spatially parallel decoding, which can be accomplished by dividing the physical qubits on the device into multiple overlapping groups and assigning a decoder module to each. We refer to this approach asspatially parallel windows. While previous work has explored similar ideas, none have addressed system-specific considerations pertinent to the task or the constraints from using hardware accelerators. In this work, we demonstrate how to configure spatially parallel windows, so that the scheme (1) is compatible with hardware accelerators, (2) supports general lattice surgery operations, (3) maintains the fidelity of the logical qubits, and (4) meets the throughput requirement for real time decoding. Furthermore, our results reveal the importance of optimally choosing the buffer width to achieve a balance between accuracy and throughput—a decision that should be influenced by the device’s physical noise.more » « less
An official website of the United States government

