skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: LESS: A Matrix Split and Balance Algorithm for Parallel Circuit (Optical) or Hybrid Data Center Switching and More
The research problem of how to use a high-speed circuit switch, typically an optical switch, to most effectively boost the switching capacity of a datacenter network, has been extensively studied. In this work, we focus on a different but related research problem that arises when multiple (say $$s$$) parallel circuit switches are used: How to best split a switching workload $$D$$ into sub-workloads $$D_1, D_2, ..., D_s$$, and give them to the $$s$$ switches as their respective workloads, so that the overall makespan of the parallel switching system is minimized? Computing such an optimal split is unfortunately NP-hard, since the circuit/optical switch incurs a nontrivial reconfiguration delay when the switch configuration has to change. In this work, we formulate a weaker form of this problem: How to minimize the total number of nonzero entries in $$D_1, D_2, ..., D_s$$ (so that the overall reconfiguration cost can be kept low), under the constraint that every row or column sum of $$D$$ (which corresponds to the workload imposed on a sending or receiving rack respectively) is evenly split? Although this weaker problem is still NP-hard, we are able to design LESS, an approximation algorithm that has a low approximation ratio of only $$1+\epsilon$$ in practice and a low computational complexity of only $O(m^2)$, where $$m = \|D\|_0$$ is the number of nonzero entries in $$D$$. Our simulation studies show that LESS results in excellent overall makespan performances under realistic datacenter traffic workloads and parameter settings.  more » « less
Award ID(s):
1909048 1717947
PAR ID:
10167954
Author(s) / Creator(s):
;  ;
Date Published:
Journal Name:
IEEE/ACM 12th International Conference on Utility and Cloud Computing (UCC'19), December 2--5, 2019, Auckland, New Zealand
Page Range / eLocation ID:
187 to 197
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The cumulative pebbling complexity of a directed acyclic graph G is defined as cc(G) = min_P ∑_i |P_i|, where the minimum is taken over all legal (parallel) black pebblings of G and |P_i| denotes the number of pebbles on the graph during round i. Intuitively, cc(G) captures the amortized Space-Time complexity of pebbling m copies of G in parallel. The cumulative pebbling complexity of a graph G is of particular interest in the field of cryptography as cc(G) is tightly related to the amortized Area-Time complexity of the Data-Independent Memory-Hard Function (iMHF) f_{G,H} [Joël Alwen and Vladimir Serbinenko, 2015] defined using a constant indegree directed acyclic graph (DAG) G and a random oracle H(⋅). A secure iMHF should have amortized Space-Time complexity as high as possible, e.g., to deter brute-force password attacker who wants to find x such that f_{G,H}(x) = h. Thus, to analyze the (in)security of a candidate iMHF f_{G,H}, it is crucial to estimate the value cc(G) but currently, upper and lower bounds for leading iMHF candidates differ by several orders of magnitude. Blocki and Zhou recently showed that it is NP-Hard to compute cc(G), but their techniques do not even rule out an efficient (1+ε)-approximation algorithm for any constant ε>0. We show that for any constant c > 0, it is Unique Games hard to approximate cc(G) to within a factor of c. Along the way, we show the hardness of approximation of the DAG Vertex Deletion problem on DAGs of constant indegree. Namely, we show that for any k,ε >0 and given a DAG G with N nodes and constant indegree, it is Unique Games hard to distinguish between the case that G is (e_1, d_1)-reducible with e_1=N^{1/(1+2 ε)}/k and d_1=k N^{2 ε/(1+2 ε)}, and the case that G is (e_2, d_2)-depth-robust with e_2 = (1-ε)k e_1 and d_2= 0.9 N^{(1+ε)/(1+2 ε)}, which may be of independent interest. Our result generalizes a result of Svensson who proved an analogous result for DAGs with indegree 𝒪(N). 
    more » « less
  2. Reconfigurable datacenter networks use fast optical circuit switches to provide high bandwidths at low cost, therefore emerging as a compelling alternative to packet switching. These switches offer micro- and nano-second reconfiguration, and reacting to demand at this time scale is infeasible. Proposed designs have therefore largely been oblivious, supporting arbitrary traffic patterns. However, this imposes a fundamental latency-throughput tradeoff that significantly limits the benefits of these switches. In this paper, we illustrate the feasibility of semi-oblivious reconfigurable datacenter networks that periodically adapt to large-scale structural patterns in traffic. We argue that such patterns are predictable in modern datacenters, that optimizing for them can provide latency-throughput scaling superior to oblivious designs, and that existing fast circuit-switched technologies support coarse-grained flexibility to adapt to these patterns. 
    more » « less
  3. Abstract In this paper, we show that $$\lambda (z_1) -\lambda (z_2)$$, $$\lambda (z_1)$$, and $$1-\lambda (z_1)$$ are all Borcherds products on $$X(2) \times X(2)$$. We then use the big CM value formula of Bruinier, Kudla, and Yang to give explicit factorization formulas for the norms of $$\lambda (\frac{d+\sqrt d}2)$$, $$1-\lambda (\frac{d+\sqrt d}2)$$, and $$\lambda (\frac{d_1+\sqrt{d_1}}2) -\lambda (\frac{d_2+\sqrt{d_2}}2)$$, with the latter under the condition $$(d_1, d_2)=1$$. Finally, we use these results to show that $$\lambda (\frac{d+\sqrt d}2)$$ is always an algebraic integer and can be easily used to construct units in the ray class field of $${\mathbb{Q}}(\sqrt{d})$$ of modulus $$2$$. In the process, we also give explicit formulas for a whole family of local Whittaker functions, which are of independent interest. 
    more » « less
  4. Benchmarks that closely match the behavior of production workloads are crucial to design and provision computer systems. However, current approaches fall short: First, open-source benchmarks use public datasets that cause different behavior from production workloads. Second, blackbox workload cloning techniques generate synthetic code that imitates the target workload, but the resulting program fails to capture most workload characteristics, such as microarchitectural bottlenecks or time-varying behavior. Generating code that mimics a complex application is an extremely hard problem. Instead, we propose a different and easier approach to benchmark synthesis. Our key insight is that, for many production workloads, the program is publicly available or there is a reasonably similar open-source program. In this case, generating the right dataset is sufficient to produce an accurate benchmark. Based on this observation, we present Datamime, a profile-guided approach to generate representative benchmarks for production workloads. Datamime uses the performance profiles of a target workload to generate a dataset that, when used by a benchmark program, behaves very similarly to the target workload in terms of its microarchitectural characteristics. We evaluate Datamime on several datacenter workloads. Datamime generates synthetic benchmarks that closely match the microarchitectural features of these workloads, with a mean absolute percentage error of 3.2% on IPC. Microarchitectural behavior stays close across processor types. Finally, time-varying behaviors are also replicated, making these benchmarks useful to e.g. characterize and optimize tail latency 
    more » « less
  5. Despite advances in network security, attacks targeting mission critical systems and applications remain a significant problem for network and datacenter providers. Existing telemetry platforms detect volumetric attacks at terabit scales using approximation techniques and coarse grain analysis. However, the prevalence of low and slow attacks that require very little bandwidth, makes flow-state tracking critical to overall attack mitigation. Traffic queries deployed on network switches are often limited by hardware constraints, preventing them from carrying out flow tracking features required to detect stealthy attacks. Such attacks can go undetected in the midst of high traffic volumes. We design SmartWatch, a novel flow state tracking and flow logging system at line rate, using SmartNICs to optimize performance and simultaneously detect a number of stealthy attacks. SmartWatch leverages advances in switch based network telemetry platforms to process the bulk of the traffic and only forward suspicious traffic subsets to the SmartNIC. The programmable network switches perform coarse-grained traffic analysis while the SmartNIC conducts the finer-grained analysis which involves additional processing of the packet as a 'bump-in-the-wire'. A control loop between the SmartNIC and programmable switch tunes the queries performed in the switch to direct the most appropriate traffic subset to the SmartNIC. SmartWatch's cooperative monitoring approach yields 2.39 times better detection rate compared to existing platforms deployed on programmable switches. SmartWatch can detect covert timing channels and perform website fingerprinting more efficiently compared to standalone programmable switch solutions, relieving switch memory and control-plane processor resources. Compared to host-based approaches, SmartWatch can reduce the packet processing latency by 72.32%. 
    more » « less