skip to main content


Title: LESS: A Matrix Split and Balance Algorithm for Parallel Circuit (Optical) or Hybrid Data Center Switching and More
The research problem of how to use a high-speed circuit switch, typically an optical switch, to most effectively boost the switching capacity of a datacenter network, has been extensively studied. In this work, we focus on a different but related research problem that arises when multiple (say $s$) parallel circuit switches are used: How to best split a switching workload $D$ into sub-workloads $D_1, D_2, ..., D_s$, and give them to the $s$ switches as their respective workloads, so that the overall makespan of the parallel switching system is minimized? Computing such an optimal split is unfortunately NP-hard, since the circuit/optical switch incurs a nontrivial reconfiguration delay when the switch configuration has to change. In this work, we formulate a weaker form of this problem: How to minimize the total number of nonzero entries in $D_1, D_2, ..., D_s$ (so that the overall reconfiguration cost can be kept low), under the constraint that every row or column sum of $D$ (which corresponds to the workload imposed on a sending or receiving rack respectively) is evenly split? Although this weaker problem is still NP-hard, we are able to design LESS, an approximation algorithm that has a low approximation ratio of only $1+\epsilon$ in practice and a low computational complexity of only $O(m^2)$, where $m = \|D\|_0$ is the number of nonzero entries in $D$. Our simulation studies show that LESS results in excellent overall makespan performances under realistic datacenter traffic workloads and parameter settings.  more » « less
Award ID(s):
1909048 1717947
NSF-PAR ID:
10167954
Author(s) / Creator(s):
;  ;
Date Published:
Journal Name:
IEEE/ACM 12th International Conference on Utility and Cloud Computing (UCC'19), December 2--5, 2019, Auckland, New Zealand
Page Range / eLocation ID:
187 to 197
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The cumulative pebbling complexity of a directed acyclic graph G is defined as cc(G) = min_P ∑_i |P_i|, where the minimum is taken over all legal (parallel) black pebblings of G and |P_i| denotes the number of pebbles on the graph during round i. Intuitively, cc(G) captures the amortized Space-Time complexity of pebbling m copies of G in parallel. The cumulative pebbling complexity of a graph G is of particular interest in the field of cryptography as cc(G) is tightly related to the amortized Area-Time complexity of the Data-Independent Memory-Hard Function (iMHF) f_{G,H} [Joël Alwen and Vladimir Serbinenko, 2015] defined using a constant indegree directed acyclic graph (DAG) G and a random oracle H(⋅). A secure iMHF should have amortized Space-Time complexity as high as possible, e.g., to deter brute-force password attacker who wants to find x such that f_{G,H}(x) = h. Thus, to analyze the (in)security of a candidate iMHF f_{G,H}, it is crucial to estimate the value cc(G) but currently, upper and lower bounds for leading iMHF candidates differ by several orders of magnitude. Blocki and Zhou recently showed that it is NP-Hard to compute cc(G), but their techniques do not even rule out an efficient (1+ε)-approximation algorithm for any constant ε>0. We show that for any constant c > 0, it is Unique Games hard to approximate cc(G) to within a factor of c. Along the way, we show the hardness of approximation of the DAG Vertex Deletion problem on DAGs of constant indegree. Namely, we show that for any k,ε >0 and given a DAG G with N nodes and constant indegree, it is Unique Games hard to distinguish between the case that G is (e_1, d_1)-reducible with e_1=N^{1/(1+2 ε)}/k and d_1=k N^{2 ε/(1+2 ε)}, and the case that G is (e_2, d_2)-depth-robust with e_2 = (1-ε)k e_1 and d_2= 0.9 N^{(1+ε)/(1+2 ε)}, which may be of independent interest. Our result generalizes a result of Svensson who proved an analogous result for DAGs with indegree 𝒪(N). 
    more » « less
  2. Despite advances in network security, attacks targeting mission critical systems and applications remain a significant problem for network and datacenter providers. Existing telemetry platforms detect volumetric attacks at terabit scales using approximation techniques and coarse grain analysis. However, the prevalence of low and slow attacks that require very little bandwidth, makes flow-state tracking critical to overall attack mitigation. Traffic queries deployed on network switches are often limited by hardware constraints, preventing them from carrying out flow tracking features required to detect stealthy attacks. Such attacks can go undetected in the midst of high traffic volumes. We design SmartWatch, a novel flow state tracking and flow logging system at line rate, using SmartNICs to optimize performance and simultaneously detect a number of stealthy attacks. SmartWatch leverages advances in switch based network telemetry platforms to process the bulk of the traffic and only forward suspicious traffic subsets to the SmartNIC. The programmable network switches perform coarse-grained traffic analysis while the SmartNIC conducts the finer-grained analysis which involves additional processing of the packet as a 'bump-in-the-wire'. A control loop between the SmartNIC and programmable switch tunes the queries performed in the switch to direct the most appropriate traffic subset to the SmartNIC. SmartWatch's cooperative monitoring approach yields 2.39 times better detection rate compared to existing platforms deployed on programmable switches. SmartWatch can detect covert timing channels and perform website fingerprinting more efficiently compared to standalone programmable switch solutions, relieving switch memory and control-plane processor resources. Compared to host-based approaches, SmartWatch can reduce the packet processing latency by 72.32%. 
    more » « less
  3. The Minimum Circuit Size Problem (MCSP) has been the focus of intense study recently; MCSP is hard for SZK under rather powerful reductions, and is provably not hard under "local" reductions computable in TIME(n^0.49) . The question of whether MCSP is NP-hard (or indeed, hard even for small subclasses of P) under some of the more familiar notions of reducibility (such as many-one or Turing reductions computable in polynomial time or in AC^0) is closely related to many of the longstanding open questions in complexity theory. All prior hardness results for MCSP hold also for computing somewhat weak approximations to the circuit complexity of a function. Some of these results were proved by exploiting a connection to a notion of time-bounded Kolmogorov complexity (KT) and the corresponding decision problem (MKTP). More recently, a new approach for proving improved hardness results for MKTP was developed, but this approach establishes only hardness of extremely good approximations of the form 1+o(1), and these improved hardness results are not yet known to hold for MCSP. In particular, it is known that MKTP is hard for the complexity class DET under nonuniform AC^0 m-reductions, implying MKTP is not in AC^0[p] for any prime p. It was still open if similar circuit lower bounds hold for MCSP. One possible avenue for proving a similar hardness result for MCSP would be to improve the hardness of approximation for MKTP beyond 1 + o(1) to omega(1), as KT-complexity and circuit size are polynomially-related. In this paper, we show that this approach cannot succeed. More speci cally, we prove that PARITY does not reduce to the problem of computing superlinear approximations to KT-complexity or circuit size via AC^0-Turing reductions that make O(1) queries. This is signi cant, since approximating any set in P/poly AC^0-reduces to just one query of a much worse approximation of circuit size or KT-complexity. For weaker approximations, we also prove non-hardness under more powerful reductions. Our non-hardness results are unconditional, in contrast to conditional results presented in earlier work (for more powerful reductions, but for much worse approximations). This highlights obstacles that would have to be overcome by any proof that MKTP or MCSP is hard for NP under AC^0 reductions. It may also be a step toward con rming a conjecture of Murray and Williams, that MCSP is not NP-complete under logtime-uniform AC0 m-reductions. 
    more » « less
  4. Abstract We study the low-rank phase retrieval problem, where our goal is to recover a $d_1\times d_2$ low-rank matrix from a series of phaseless linear measurements. This is a fourth-order inverse problem, as we are trying to recover factors of a matrix that have been observed, indirectly, through some quadratic measurements. We propose a solution to this problem using the recently introduced technique of anchored regression. This approach uses two different types of convex relaxations: we replace the quadratic equality constraints for the phaseless measurements by a search over a polytope and enforce the rank constraint through nuclear norm regularization. The result is a convex program in the space of $d_1 \times d_2$ matrices. We analyze two specific scenarios. In the first, the target matrix is rank-$1$, and the observations are structured to correspond to a phaseless blind deconvolution. In the second, the target matrix has general rank, and we observe the magnitudes of the inner products against a series of independent Gaussian random matrices. In each of these problems, we show that anchored regression returns an accurate estimate from a near-optimal number of measurements given that we have access to an anchor matrix of sufficient quality. We also show how to create such an anchor in the phaseless blind deconvolution problem from an optimal number of measurements and present a partial result in this direction for the general rank problem. 
    more » « less
  5. Benchmarks that closely match the behavior of production workloads are crucial to design and provision computer systems. However, current approaches fall short: First, open-source benchmarks use public datasets that cause different behavior from production workloads. Second, blackbox workload cloning techniques generate synthetic code that imitates the target workload, but the resulting program fails to capture most workload characteristics, such as microarchitectural bottlenecks or time-varying behavior. Generating code that mimics a complex application is an extremely hard problem. Instead, we propose a different and easier approach to benchmark synthesis. Our key insight is that, for many production workloads, the program is publicly available or there is a reasonably similar open-source program. In this case, generating the right dataset is sufficient to produce an accurate benchmark. Based on this observation, we present Datamime, a profile-guided approach to generate representative benchmarks for production workloads. Datamime uses the performance profiles of a target workload to generate a dataset that, when used by a benchmark program, behaves very similarly to the target workload in terms of its microarchitectural characteristics. We evaluate Datamime on several datacenter workloads. Datamime generates synthetic benchmarks that closely match the microarchitectural features of these workloads, with a mean absolute percentage error of 3.2% on IPC. Microarchitectural behavior stays close across processor types. Finally, time-varying behaviors are also replicated, making these benchmarks useful to e.g. characterize and optimize tail latency 
    more » « less