skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on August 4, 2026

Title: MDTP—An Adaptive Multi-Source Data Transfer Protocol
Scientific data volume is growing, and the need for faster transfers is increasing. The community has used parallel transfer methods with multi-threaded and multi-source downloads to reduce transfer times. In multi-source transfers, a client downloads data from several replicated servers in parallel. Tools such as Aria2 and BitTorrent support this approach and show improved performance. This work introduces the Multi-Source Data Transfer Protocol, MDTP, which improves multi-source transfer performance further. MDTP divides a file request into smaller chunk requests and assigns the chunks across multiple servers. The system adapts chunk sizes based on each server’s performance and selects them so each round of requests finishes at roughly the same time. The chunk-size allocation problem is formulated as a variant of bin packing, where adaptive chunking fills the capacity “bins’’ of each server efficiently. Evaluation shows that MDTP reduces transfer time by 10–22% compared to Aria2. Comparisons with static chunking and BitTorrent show even larger gains. MDTP also distributes load proportionally across all replicas instead of relying only on the fastest one, which increases throughput. MDTP maintains high throughput even when latency increases or bandwidth to the fastest server drops.  more » « less
Award ID(s):
2430341 2126148 2019012
PAR ID:
10647698
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
IEEE
Date Published:
Page Range / eLocation ID:
1 to 9
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Replication is essential for fault-tolerance. However, in in-memory systems, it is a source of high overhead. Remote direct memory access (RDMA) is attractive to create redundant copies of data, since it is low-latency and has no CPU overhead at the target. However, existing approaches still result in redundant data copying and active receivers. To ensure atomic data transfers, receivers check and apply only fully received messages. Tailwind is a zero-copy recovery-log replication protocol for scale-out in-memory databases. Tailwind is the first replication protocol that eliminates all CPU-driven data copying and fully bypasses target server CPUs, thus leaving backups idle. Tailwind ensures all writes are atomic by leveraging a protocol that detects incomplete RDMA transfers. Tailwind substantially improves replication throughput and response latency compared with conventional RPC-based replication. In symmetric systems where servers both serve requests and act as replicas, Tailwind also improves normal-case throughput by freeing server CPU resources for request processing. We implemented and evaluated Tailwind on RAMCloud, a low-latency in-memory storage system. Experiments show Tailwind improves RAMCloud's normal-case request processing throughput by 1.7x. It also cuts down writes median and 99th percentile latencies by 2x and 3x respectively. 
    more » « less
  2. null (Ed.)
    Applications in cloud platforms motivate the study of efficient load balancing under job-server constraints and server heterogeneity. In this paper, we study load balancing on a bipartite graph where left nodes correspond to job types and right nodes correspond to servers, with each edge indicating that a job type can be served by a server. Thus edges represent locality constraints, i.e., an arbitrary job can only be served at servers which contain certain data and/or machine learning (ML) models. Servers in this system can have heterogeneous service rates. In this setting, we investigate the performance of two policies named Join-the-Fastest-of-the-Shortest-Queue (JFSQ) and Join-the-Fastest-of-the-Idle-Queue (JFIQ), which are simple variants of Join-the-Shortest-Queue and Join-the-Idle-Queue, where ties are broken in favor of the fastest servers. Under a "well-connected'' graph condition, we show that JFSQ and JFIQ are asymptotically optimal in the mean response time when the number of servers goes to infinity. In addition to asymptotic optimality, we also obtain upper bounds on the mean response time for finite-size systems. We further show that the well-connectedness condition can be satisfied by a random bipartite graph construction with relatively sparse connectivity. 
    more » « less
  3. Everyone puts things off sometimes. How can we combat this tendency to procrastinate? A well-known technique used by instructors is to break up a large project into more manageable chunks. But how should this be done best? Here we study the process of chunking using the graph-theoretic model of present bias introduced by Kleinberg and Oren [2014]. We first analyze how to optimally chunk single edges within a task graph, given a limited number of chunks. We show that for edges on the shortest path, the optimal chunking makes initial chunks easy and later chunks progressively harder. For edges not on the shortest path, optimal chunking is significantly more complex, but we provide an efficient algorithm that chunks the edge optimally. We then use our optimal edge-chunking algorithm to optimally chunk task graphs. We show that with a linear number of chunks on each edge, the biased agent’s cost can be exponentially lowered, to within a constant factor of the true cheapest path. Finally, we extend our model to the case where a task designer must chunk a graph for multiple types of agents simultaneously. The problem grows significantly more complex with even two types of agents, but we provide optimal graph chunking algorithms for two types. Our work highlights the efficacy of chunking as a means to combat present bias. 
    more » « less
  4. We consider large-scale load balancing systems where processing time distribution of tasks depend on both task and server types. We analyze the system in the asymptotic regime where the number of task and server types tend to infinity proportionally to each other. In such heterogeneous setting, popular policies like Join Fastest Idle Queue (JFIQ), Join Fastest Shortest Queue (JFSQ) are known to perform poorly and they even shrink the stability region. Moreover, to the best of our knowledge, in this setup, finding a scalable policy with provable performance guarantee has been an open question prior to this work. In this paper, we propose and analyze two asymptotically delay-optimal dynamic load balancing approaches: (a) one that efficiently reserves the processing capacity of each server for good tasks and route tasks under the Join Idle Queue policy; and (b) a speed-priority policy that increases the probability of servers processing tasks at a high speed. Introducing a novel analytical framework and using the mean-field method and stochastic coupling arguments, we prove that both policies above achieve asymptotic zero queueing, whereby the probability that a typical task is assigned to an idle server tends to 1 as the system scales. 
    more » « less
  5. Data deduplication has been widely used in storage systems to improve storage efficiency and I/O performance. In particular, content-defined variable-size chunking (CDC) is often used in data deduplication systems for its capability to detect and remove duplicate data in modified files. However, the CDC algorithm is very compute-intensive and inherently sequential. Efforts on accelerating it by segmenting a file and running the algorithm independently on each segment in parallel come at a cost of substantial degradation of deduplication ratio. In this paper, we propose SS-CDC, a two-stage parallel CDC, that enables (almost) full parallelism on chunking of a file without compromising deduplication ratio. Further, SS-CDC exploits instruction-level SIMD parallelism available in today's processors. As a case study, by using Intel AVX-512 instructions, SS-CDC consistently obtains superlinear speedups on a multi-core server. Our experiments using real-world datasets show that, compared to existing parallel CDC methods which only achieve up to a 7.7X speedup on an 8-core processor with the deduplication ratio degraded by up to 40%, SS-CDC can achieve up to a 25.6X speedup with no loss of deduplication ratio. 
    more » « less