We initiate a systematic study of linear sketching over $$\ftwo$$. For a given Boolean function treated as $$f \colon \ftwo^n \to \ftwo$$ a randomized $$\ftwo$$-sketch is a distribution $$\mathcal M$$ over $$d \times n$$ matrices with elements over $$\ftwo$$ such that $$\mathcal Mx$$ suffices for computing $f(x)$ with high probability. Such sketches for $$d \ll n$$ can be used to design small-space distributed and streaming algorithms. Motivated by these applications we study a connection between $$\ftwo$$-sketching and a two-player one-way communication game for the corresponding XOR-function. We conjecture that $$\ftwo$$-sketching is optimal for this communication game. Our results confirm this conjecture for multiple important classes of functions: 1) low-degree $$\ftwo$$-polynomials, 2) functions with sparse Fourier spectrum, 3) most symmetric functions, 4) recursive majority function. These results rely on a new structural theorem that shows that $$\ftwo$$-sketching is optimal (up to constant factors) for uniformly distributed inputs. Furthermore, we show that (non-uniform) streaming algorithms that have to process random updates over $$\ftwo$$ can be constructed as $$\ftwo$$-sketches for the uniform distribution. In contrast with the previous work of Li, Nguyen and Woodruff (STOC'14) who show an analogous result for linear sketches over integers in the adversarial setting our result does not require the stream length to be triply exponential in $$n$$ and holds for streams of length $$\tilde O(n)$$ constructed through uniformly random updates.
more »
« less
Living on the Edge: Data Transmission, Storage, and Analytics in Continuous Sensing Environments
Voluminous time-series data streams produced in continuous sensing environments impose challenges pertaining to ingestion, storage, and analytics. In this study, we present a holistic approach based on data sketching to address these issues. We propose a hyper-sketching algorithm that combines discretization and frequency-based sketching to produce compact representations of the multi-feature, time-series data streams. We generate an ensemble of data sketches to make effective use of capabilities at the resource-constrained edge devices, the links over which data are transmitted, and the server pool where this data must be stored. The data sketches can be queried to construct datasets that are amenable to processing using popular analytical engines. We include several performance benchmarks using real-world data from different domains to profile the suitability of our design decisions. The proposed methodology can achieve up to ∼ 13 × and ∼ 2, 207 × reduction in data transfer and energy consumption at edge devices. We observe up to a ∼ 50% improvement in analytical job completion times in addition to the significant improvements in disk and network I/O.
more »
« less
- Award ID(s):
- 1931363
- PAR ID:
- 10284571
- Date Published:
- Journal Name:
- ACM Transactions on Internet of Things
- Volume:
- 2
- Issue:
- 3
- ISSN:
- 2691-1914
- Page Range / eLocation ID:
- 1 to 31
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We initiate a systematic study of linear sketching over F_2. For a given Boolean function treated as f : F_2^n -> F_2 a randomized F_2-sketch is a distribution M over d x n matrices with elements over F_2 such that Mx suffices for computing f(x) with high probability. Such sketches for d << n can be used to design small-space distributed and streaming algorithms. Motivated by these applications we study a connection between F_2-sketching and a two-player one-way communication game for the corresponding XOR-function. We conjecture that F_2-sketching is optimal for this communication game. Our results confirm this conjecture for multiple important classes of functions: 1) low-degree F_2-polynomials, 2) functions with sparse Fourier spectrum, 3) most symmetric functions, 4) recursive majority function. These results rely on a new structural theorem that shows that F_2-sketching is optimal (up to constant factors) for uniformly distributed inputs. Furthermore, we show that (non-uniform) streaming algorithms that have to process random updates over F_2 can be constructed as F_2-sketches for the uniform distribution. In contrast with the previous work of Li, Nguyen and Woodruff (STOC'14) who show an analogous result for linear sketches over integers in the adversarial setting our result does not require the stream length to be triply exponential in n and holds for streams of length O(n) constructed through uniformly random updates.more » « less
-
null (Ed.)Building a sketch of an n-by-n empirical kernel matrix is a common approach to accelerate the computation of many kernel methods. In this paper, we propose a unified framework of constructing sketching methods in kernel ridge regression (KRR), which views the sketching matrix S as an accumulation of m rescaled sub-sampling matrices with independent columns. Our framework incorporates two commonly used sketching methods, sub-sampling sketches (known as the Nyström method) and sub-Gaussian sketches, as special cases with m=1 and m=infinity respectively. Under the new framework, we provide a unified error analysis of sketching approximation and show that our accumulation scheme improves the low accuracy of sub-sampling sketches when certain incoherence characteristic is high, and accelerates the more accurate but computationally heavier sub-Gaussian sketches. By optimally choosing the number m of accumulations, we show that a best trade-off between computational efficiency and statistical accuracy can be achieved. In practice, the sketching method can be as efficiently implemented as the sub-sampling sketches, as only minor extra matrix additions are needed. Our empirical evaluations also demonstrate that the proposed method may attain the accuracy close to sub-Gaussian sketches, while is as efficient as sub-sampling-based sketches.more » « less
-
We introduce a new sub-linear space sketch the "Weight-Median Sketch" for learning compressed linear classifiers over data streams while supporting the efficient recovery of large-magnitude weights in the model. This enables memory-limited execution of several statistical analyses over streams, including online feature selection, streaming data explanation, relative deltoid detection, and streaming estimation of pointwise mutual information. Unlike related sketches that capture the most frequently-occurring features (or items) in a data stream, the Weight-Median Sketch captures the features that are most discriminative of one stream (or class) compared to another. The Weight-Median Sketch adopts the core data structure used in the Count-Sketch, but, instead of sketching counts, it captures sketched gradient updates to the model parameters. We provide a theoretical analysis that establishes recovery guarantees for batch and online learning, and demonstrate empirical improvements in memory-accuracy trade-offs over alternative memory-budgeted methods, including count-based sketches and feature hashing.more » « less
-
Summary Comprehending 3D diagrams is critical for success in scientific practice and research demonstrates that understanding of 3D geology diagrams can be improved by making predictive sketches. In mathematics, explaining erroneous examples can support learning. This study combined these approaches to better understand how to effectively support 3D geologic diagram understanding. Participants generated sketches, explained erroneous example sketches, or copied and explained correct sketches. It was hypothesized that generating sketches or explaining erroneous cases would improve understanding, but an open question was whether these conditions would differ from each other. Explaining erroneous examples and sketching improved understanding whereas explaining correct sketches did not. Further, explaining erroneous examples was a more efficient strategy than sketching. These results indicate that erroneous examples can be effective for supporting 3D diagram comprehension and may be a practical substitute for some traditional sketching activities in the context of real classrooms where class time is limited.more » « less
An official website of the United States government

