This paper introduces a new data-structural object that we call the tiny pointer. In many applications, traditional\(\log n\)-bit pointers can be replaced with\(o(\log n)\)-bit tiny pointers at the cost of only a constant-factor time overhead and a small probability of failure. We develop a comprehensive theory of tiny pointers, and give optimal constructions for both fixed-size tiny pointers (i.e., settings in which all of the tiny pointers must be the same size) and variable-size tiny pointers (i.e., settings in which the average tiny-pointer size must be small, but some tiny pointers can be larger). If a tiny pointer references an item in an array filled to load factor\(1-\delta\), then the optimal tiny-pointer size is\(\Theta(\log\log\log n+\log\delta^{-1})\)bits in the fixed-size case, and\(\Theta(\log\delta^{-1})\)expected bits in the variable-size case. Our tiny-pointer constructions also require us to revisit several classic problems having to do with balls and bins; these results may be of independent interest. Using tiny pointers, we apply tiny pointers to five classic data-structure problems. We show that:A data structure storing\(n\)\(v\)-bit values for\(n\)keys with constant-factor time modifications/queries can be implemented to take space\(nv+O(n\log^{(r)}n)\)bits, for any constant\(r\gt0\), as long as the user stores a tiny pointer of expected size\(O(1)\)with each key—here,\(\log^{(r)}n\)is the\(r\)-th iterated logarithm.Any binary search tree can be made succinct, meaning that it achieves\((1+o(1))\)times the optimal space, with constant-factor time overhead, and can even be made to be within\(O(n)\)bits of optimal if we allow for\(O(\log^{*}n)\)-time modifications—this holds even for rotation-based trees such as the splay tree and the red-black tree.Any fixed-capacity key-value dictionary can be made stable (i.e., items do not move once inserted) with constant-factor time overhead and\((1+o(1))\)-factor space overhead.Any key-value dictionary that requires uniform-size values can be made to support arbitrary-size values with constant-factor time overhead and with an additional space consumption of\(\log^{(r)}n+O(\log j)\)bits per\(j\)-bit value for an arbitrary constant\(r\gt0\)of our choice.Given an external-memory array\(A\)of size\((1+\varepsilon)n\)containing a dynamic set of up to\(n\)key-value pairs, it is possible to maintain an internal-memory stash of size\(O(n\log\varepsilon^{-1})\)bits so that the location of any key-value pair in\(A\)can be computed in constant time (and with no IOs). In each case tiny pointers allow for us to take a natural space-inefficient solution that uses pointers and make it space-efficient for free. 
                        more » 
                        « less   
                    
                            
                            Hardware-Sensitive Fairness in Heterogeneous Federated Learning
                        
                    
    
            Federated learning (FL) is a promising technique for decentralized privacy-preserving Machine Learning (ML) with a diverse pool of participating devices with varying device capabilities. However, existing approaches to handle such heterogeneous environments do not consider “fairness” in model aggregation, resulting in significant performance variation among devices. Meanwhile, prior works on FL fairness remain hardware-oblivious and cannot be applied directly without severe performance penalties. To address this issue, we propose a novel hardware-sensitive FL method called\(\mathsf {FairHetero}\)that promotes fairness among heterogeneous federated clients. Our approach offers tunable fairness within a group of devices with the same ML architecture as well as across different groups with heterogeneous models. Our evaluation underMNIST,FEMNIST,CIFAR10, andSHAKESPEAREdatasets demonstrates that\(\mathsf {FairHetero}\)can reduce variance among participating clients’ test loss compared to the existing state-of-the-art techniques, resulting in increased overall performance. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10605331
- Publisher / Repository:
- Association for Computing Machinery (ACM)
- Date Published:
- Journal Name:
- ACM Transactions on Modeling and Performance Evaluation of Computing Systems
- Volume:
- 10
- Issue:
- 1
- ISSN:
- 2376-3639
- Format(s):
- Medium: X Size: p. 1-31
- Size(s):
- p. 1-31
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Given a weighted, ordered query set\(Q\)and a partition of\(Q\)into classes, we study the problem of computing a minimum-cost decision tree that, given any query\(q\in Q\), uses equality tests and less-than tests to determine\(q\)'s class. Such a tree can be faster and smaller than a conventional search tree and smaller than a lookup table (both of which must identify\(q\), not just its class). We give the first polynomial-time algorithm for the problem. The algorithm extends naturally to the setting where each query has multiple allowed classes.more » « less
- 
            Stream processing, which involves real-time computation of data as it is created or received, is vital for various applications, specifically wireless communication. The evolving protocols, the requirement for high-throughput, and the challenges of handling diverse processing patterns make it demanding. Traditional platforms grapple with meeting real-time throughput and latency requirements due to large data volume, sequential and indeterministic data arrival, and variable data rates, leading to inefficiencies in memory access and parallel processing. We present Canalis, a throughput-optimized framework designed to address these challenges, ensuring high-performance while achieving low energy consumption. Canalis is a hardware-software co-designed system. It includes a programmable spatial architecture, Flux Stream Processing Unit (FluxSPU), proposed by this work to enhance data throughput and energy efficiency. FluxSPU is accompanied by a software stack that eases the programming process. We evaluated Canalis with eight distinct benchmarks. When compared to CPU and GPU in mobile SoC to demonstrate the effectiveness of domain specialization, Canalis achieves an average speedup of 13.4\(\times\)and 6.6\(\times\), and energy savings of 189.8\(\times\)and 283.9\(\times\), respectively. In contrast to equivalent ASICs of the benchmarks, the average energy overhead of Canalis is within 2.4\(\times\), successfully maintaining generalizations without incurring significant overhead.more » « less
- 
            Hyperdimensional computing (HDC) is a computing paradigm inspired by the mechanisms of human memory, characterizing data through high-dimensional vector representations, known as hypervectors. Recent advancements in HDC have explored its potential as a learning model, leveraging its straightforward arithmetic and high efficiency. The traditional HDC frameworks are hampered by two primary static elements: randomly generated encoders and fixed learning rates. These static components significantly limit model adaptability and accuracy. The static, randomly generated encoders, while ensuring high-dimensional representation, fail to adapt to evolving data relationships, thereby constraining the model’s ability to accurately capture and learn from complex patterns. Similarly, the fixed nature of the learning rate does not account for the varying needs of the training process over time, hindering efficient convergence and optimal performance. This paper introduces\(\mathsf {TrainableHD} \), a novel HDC framework that enables dynamic training of the randomly generated encoder depending on the feedback of the learning data, thereby addressing the static nature of conventional HDC encoders.\(\mathsf {TrainableHD} \)also enhances the training performance by incorporating adaptive optimizer algorithms in learning the hypervectors. We further refine\(\mathsf {TrainableHD} \)with effective quantization to enhance efficiency, allowing the execution of the inference phase in low-precision accelerators. Our evaluations demonstrate that\(\mathsf {TrainableHD} \)significantly improves HDC accuracy by up to 27.99% (averaging 7.02%) without additional computational costs during inference, achieving a performance level comparable to state-of-the-art deep learning models. Furthermore,\(\mathsf {TrainableHD} \)is optimized for execution speed and energy efficiency. Compared to deep learning on a low-power GPU platform like NVIDIA Jetson Xavier,\(\mathsf {TrainableHD} \)is 56.4 times faster and 73 times more energy efficient. This efficiency is further augmented through the use of Encoder Interval Training (EIT) and adaptive optimizer algorithms, enhancing the training process without compromising the model’s accuracy.more » « less
- 
            A constraint satisfaction problem (CSP),\(\textsf {Max-CSP}(\mathcal {F})\), is specified by a finite set of constraints\(\mathcal {F}\subseteq \lbrace [q]^k \rightarrow \lbrace 0,1\rbrace \rbrace\)for positive integersqandk. An instance of the problem onnvariables is given bymapplications of constraints from\(\mathcal {F}\)to subsequences of thenvariables, and the goal is to find an assignment to the variables that satisfies the maximum number of constraints. In the (γ ,β)-approximation version of the problem for parameters 0 ≤ β ≤ γ ≤ 1, the goal is to distinguish instances where at least γ fraction of the constraints can be satisfied from instances where at most β fraction of the constraints can be satisfied. In this work, we consider the approximability of this problem in the context of sketching algorithms and give a dichotomy result. Specifically, for every family\(\mathcal {F}\)and every β < γ, we show that either a linear sketching algorithm solves the problem in polylogarithmic space or the problem is not solvable by any sketching algorithm in\(o(\sqrt {n})\)space. In particular, we give non-trivial approximation algorithms using polylogarithmic space for infinitely many constraint satisfaction problems. We also extend previously known lower bounds for general streaming algorithms to a wide variety of problems, and in particular the case ofq=k=2, where we get a dichotomy, and the case when the satisfying assignments of the constraints of\(\mathcal {F}\)support a distribution on\([q]^k\)with uniform marginals. Prior to this work, other than sporadic examples, the only systematic classes of CSPs that were analyzed considered the setting of Boolean variablesq= 2, binary constraintsk=2, and singleton families\(|\mathcal {F}|=1\)and only considered the setting where constraints are placed on literals rather than variables. Our positive results show wide applicability of bias-based algorithms used previously by [47] and [41], which we extend to include richer norm estimation algorithms, by giving a systematic way to discover biases. Our negative results combine the Fourier analytic methods of [56], which we extend to a wider class of CSPs, with a rich collection of reductions among communication complexity problems that lie at the heart of the negative results. In particular, previous works used Fourier analysis over the Boolean cube to initiate their results and the results seemed particularly tailored to functions on Boolean literals (i.e., with negations). Our techniques surprisingly allow us to get to generalq-ary CSPs without negations by appealing to the same Fourier analytic starting point over Boolean hypercubes.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
