This paper introduces a new data-structural object that we call the tiny pointer. In many applications, traditional\(\log n\)-bit pointers can be replaced with\(o(\log n)\)-bit tiny pointers at the cost of only a constant-factor time overhead and a small probability of failure. We develop a comprehensive theory of tiny pointers, and give optimal constructions for both fixed-size tiny pointers (i.e., settings in which all of the tiny pointers must be the same size) and variable-size tiny pointers (i.e., settings in which the average tiny-pointer size must be small, but some tiny pointers can be larger). If a tiny pointer references an item in an array filled to load factor\(1-\delta\), then the optimal tiny-pointer size is\(\Theta(\log\log\log n+\log\delta^{-1})\)bits in the fixed-size case, and\(\Theta(\log\delta^{-1})\)expected bits in the variable-size case. Our tiny-pointer constructions also require us to revisit several classic problems having to do with balls and bins; these results may be of independent interest. Using tiny pointers, we apply tiny pointers to five classic data-structure problems. We show that:A data structure storing\(n\)\(v\)-bit values for\(n\)keys with constant-factor time modifications/queries can be implemented to take space\(nv+O(n\log^{(r)}n)\)bits, for any constant\(r\gt0\), as long as the user stores a tiny pointer of expected size\(O(1)\)with each key—here,\(\log^{(r)}n\)is the\(r\)-th iterated logarithm.Any binary search tree can be made succinct, meaning that it achieves\((1+o(1))\)times the optimal space, with constant-factor time overhead, and can even be made to be within\(O(n)\)bits of optimal if we allow for\(O(\log^{*}n)\)-time modifications—this holds even for rotation-based trees such as the splay tree and the red-black tree.Any fixed-capacity key-value dictionary can be made stable (i.e., items do not move once inserted) with constant-factor time overhead and\((1+o(1))\)-factor space overhead.Any key-value dictionary that requires uniform-size values can be made to support arbitrary-size values with constant-factor time overhead and with an additional space consumption of\(\log^{(r)}n+O(\log j)\)bits per\(j\)-bit value for an arbitrary constant\(r\gt0\)of our choice.Given an external-memory array\(A\)of size\((1+\varepsilon)n\)containing a dynamic set of up to\(n\)key-value pairs, it is possible to maintain an internal-memory stash of size\(O(n\log\varepsilon^{-1})\)bits so that the location of any key-value pair in\(A\)can be computed in constant time (and with no IOs). In each case tiny pointers allow for us to take a natural space-inefficient solution that uses pointers and make it space-efficient for free. 
                        more » 
                        « less   
                    This content will become publicly available on December 31, 2025
                            
                            Canalis: A Throughput-Optimized Framework for Real-Time Stream Processing of Wireless Communication
                        
                    
    
            Stream processing, which involves real-time computation of data as it is created or received, is vital for various applications, specifically wireless communication. The evolving protocols, the requirement for high-throughput, and the challenges of handling diverse processing patterns make it demanding. Traditional platforms grapple with meeting real-time throughput and latency requirements due to large data volume, sequential and indeterministic data arrival, and variable data rates, leading to inefficiencies in memory access and parallel processing. We present Canalis, a throughput-optimized framework designed to address these challenges, ensuring high-performance while achieving low energy consumption. Canalis is a hardware-software co-designed system. It includes a programmable spatial architecture, Flux Stream Processing Unit (FluxSPU), proposed by this work to enhance data throughput and energy efficiency. FluxSPU is accompanied by a software stack that eases the programming process. We evaluated Canalis with eight distinct benchmarks. When compared to CPU and GPU in mobile SoC to demonstrate the effectiveness of domain specialization, Canalis achieves an average speedup of 13.4\(\times\)and 6.6\(\times\), and energy savings of 189.8\(\times\)and 283.9\(\times\), respectively. In contrast to equivalent ASICs of the benchmarks, the average energy overhead of Canalis is within 2.4\(\times\), successfully maintaining generalizations without incurring significant overhead. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1942806
- PAR ID:
- 10571760
- Publisher / Repository:
- ACM
- Date Published:
- Journal Name:
- ACM Transactions on Reconfigurable Technology and Systems
- Volume:
- 17
- Issue:
- 4
- ISSN:
- 1936-7406
- Page Range / eLocation ID:
- 1 to 32
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Processing In-Memory (PIM) is a data-centric computation paradigm that performs computations inside the memory, hence eliminating the memory wall problem in traditional computational paradigms used in Von-Neumann architectures. The associative processor, a type of PIM architecture, allows performing parallel and energy-efficient operations on vectors. This architecture is found useful in vector-based applications such as Hyper-Dimensional (HDC) Reinforcement Learning (RL). HDC is rising as a new powerful and lightweight alternative to costly traditional RL models such as Deep Q-Learning. The HDC implementation of Q-Learning relies on encoding the states in a high-dimensional representation where calculating Q-values and finding the maximum one can be done entirely in parallel. In this article, we propose to implement the main operations of a HDC RL framework on the associative processor. This acceleration achieves up to\(152.3\times\)and\(6.4\times\)energy and time savings compared to an FPGA implementation. Moreover, HDRLPIM shows that an SRAM-based AP implementation promises up to\(968.2\times\)energy-delay product gains compared to the FPGA implementation.more » « less
- 
            Given a weighted, ordered query set\(Q\)and a partition of\(Q\)into classes, we study the problem of computing a minimum-cost decision tree that, given any query\(q\in Q\), uses equality tests and less-than tests to determine\(q\)'s class. Such a tree can be faster and smaller than a conventional search tree and smaller than a lookup table (both of which must identify\(q\), not just its class). We give the first polynomial-time algorithm for the problem. The algorithm extends naturally to the setting where each query has multiple allowed classes.more » « less
- 
            Dense matrix multiply (MM) serves as one of the most heavily used kernels in deep learning applications. To cope with the high computation demands of these applications, heterogeneous architectures featuring both FPGA and dedicated ASIC accelerators have emerged as promising platforms. For example, the AMD/Xilinx Versal ACAP architecture combines general-purpose CPU cores and programmable logic with AI Engine processors optimized for AI/ML. An array of 400 AI Engine processors executing at 1 GHz can provide up to 6.4 TFLOPS performance for 32-bit floating-point (FP32) data. However, machine learning models often contain both large and small MM operations. While large MM operations can be parallelized efficiently across many cores, small MM operations typically cannot. We observe that executing some small MM layers from the BERT natural language processing model on a large, monolithic MM accelerator in Versal ACAP achieved less than 5% of the theoretical peak performance. Therefore, one key question arises:How can we design accelerators to fully use the abundant computation resources under limited communication bandwidth for end-to-end applications with multiple MM layers of diverse sizes? We identify the biggest system throughput bottleneck resulting from the mismatch between massive computation resources of one monolithic accelerator and the various MM layers of small sizes in the application. To resolve this problem, we propose the CHARM framework to composemultiple diverse MM accelerator architecturesworking concurrently on different layers within one application. CHARM includes analytical models which guide design space exploration to determine accelerator partitions and layer scheduling. To facilitate system designs, CHARM automatically generates code, enabling thorough onboard design verification. We deploy the CHARM framework on four different deep learning applications in FP32, INT16, and INT8 data types, including BERT, ViT, NCF, and MLP, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experiments show that we achieve 1.46 TFLOPS, 1.61 TFLOPS, 1.74 TFLOPS, and 2.94 TFLOPS inference throughput for BERT, ViT, NCF, and MLP in FP32 data type, respectively, which obtain 5.29\(\times\), 32.51\(\times\), 1.00\(\times\), and 1.00\(\times\)throughput gains compared to one monolithic accelerator. CHARM achieves the maximum throughput of 1.91 TOPS, 1.18 TOPS, 4.06 TOPS, and 5.81 TOPS in the INT16 data type for the four applications. The maximum throughput achieved by CHARM in the INT8 data type is 3.65 TOPS, 1.28 TOPS, 10.19 TOPS, and 21.58 TOPS, respectively. We have open-sourced our tools, including detailed step-by-step guides to reproduce all the results presented in this paper and to enable other users to learn and leverage CHARM framework and tools in their end-to-end systems:https://github.com/arc-research-lab/CHARM.more » « less
- 
            We prove novel algorithmic guarantees for several online problems in the smoothed analysis model. In this model, at each time step an adversary chooses an input distribution with density function bounded above pointwise by \(\tfrac{1}{\sigma }\)times that of the uniform distribution; nature then samples an input from this distribution. Here, σ is a parameter that interpolates between the extremes of worst-case and average case analysis. Crucially, our results hold foradaptiveadversaries that can base their choice of input distribution on the decisions of the algorithm and the realizations of the inputs in the previous time steps. An adaptive adversary can nontrivially correlate inputs at different time steps with each other and with the algorithm’s current state; this appears to rule out the standard proof approaches in smoothed analysis. This paper presents a general technique for proving smoothed algorithmic guarantees against adaptive adversaries, in effect reducing the setting of an adaptive adversary to the much simpler case of an oblivious adversary (i.e., an adversary that commits in advance to the entire sequence of input distributions). We apply this technique to prove strong smoothed guarantees for three different problems:(1)Online learning: We consider the online prediction problem, where instances are generated from an adaptive sequence of σ-smooth distributions and the hypothesis class has VC dimensiond. We bound the regret by\(\tilde{O}(\sqrt {T d\ln (1/\sigma)} + d\ln (T/\sigma))\)and provide a near-matching lower bound. Our result shows that under smoothed analysis, learnability against adaptive adversaries is characterized by the finiteness of the VC dimension. This is as opposed to the worst-case analysis, where online learnability is characterized by Littlestone dimension (which is infinite even in the extremely restricted case of one-dimensional threshold functions). Our results fully answer an open question of Rakhlin et al. [64].(2)Online discrepancy minimization: We consider the setting of the online Komlós problem, where the input is generated from an adaptive sequence of σ-smooth and isotropic distributions on the ℓ2unit ball. We bound the ℓ∞norm of the discrepancy vector by\(\tilde{O}(\ln ^2(\frac{nT}{\sigma }))\). This is as opposed to the worst-case analysis, where the tight discrepancy bound is\(\Theta (\sqrt {T/n})\). We show such\(\mathrm{polylog}(nT/\sigma)\)discrepancy guarantees are not achievable for non-isotropic σ-smooth distributions.(3)Dispersion in online optimization: We consider online optimization with piecewise Lipschitz functions where functions with ℓ discontinuities are chosen by a smoothed adaptive adversary and show that the resulting sequence is\(({\sigma }/{\sqrt {T\ell }}, \tilde{O}(\sqrt {T\ell }))\)-dispersed. That is, every ball of radius\({\sigma }/{\sqrt {T\ell }}\)is split by\(\tilde{O}(\sqrt {T\ell })\)of the partitions made by these functions. This result matches the dispersion parameters of Balcan et al. [13] for oblivious smooth adversaries, up to logarithmic factors. On the other hand, worst-case sequences are trivially (0,T)-dispersed.1more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
