Inspired by prior work suggesting undetected errors were becoming a problem on the Internet, we set out to create a measurement system to detect errors that the TCP checksum missed. We designed a client-server framework in which the servers sent known files to clients. We then compared the received data with the original file to identify undetected errors introduced by the network. We deployed this measurement framework on various public testbeds. Over the course of 9 months, we transferred a total of 26 petabytes of data. Scaling the measurement framework to capture a large number of errors proved to be a challenge. This paper focuses on the challenges encountered during the deployment of the measurement system. We also present the interim results, which suggest that the error problems seen in prior works may be caused by two distinct processes: (1) errors that slip past TCP and (2) file system failures. The interim results also suggest that the measurement system needs to be adjusted to collect exabytes of measurement data, rather than the petabytes that prior studies predicted.
more »
« less
This content will become publicly available on May 12, 2026
Looking for Errors TCP Misses
Inspired by earlier findings that undetected errors were increasing on the Internet, we built a measurement system to detect errors that the TCP checksum fails to catch. We created a client–server framework in which servers sent known files to clients, and the received data was compared to the originals to identify undetected network-introduced errors. The system was deployed on several public testbeds. Over nine months, we transferred 26 petabytes of data. Scaling the system to capture many errors proved difficult. This paper describes the deployment challenges and presents interim results showing that prior error reports may come from two different sources: errors that bypass TCP and file system failures. The results also suggest that the system must collect data at the exabyte scale rather than the petabyte scale expected by earlier studies.
more »
« less
- PAR ID:
- 10647724
- Publisher / Repository:
- IEEE
- Date Published:
- Page Range / eLocation ID:
- 1 to 6
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Ensuring the integrity of petabyte-scale file transfers is essential for the data gathered from scientific instruments. As packet sizes increase, so does the likelihood of errors, resulting in a higher probability of undetected errors in the packet. This paper presents a Multi-Level Error Detection (MLED) framework that leverages in-network resources to reduce undetected error probability (UEP) in file transmission. MLED is based on a configurable recursive architecture that organizes communication in layers at different levels, decoupling network functions such as error detection, routing, addressing, and security. Each layer Lij at level i implements a policy Pij that governs its operation, including the error detection mechanism used, specific to the scope of that layer. MLED can be configured to mimic the error detection mechanisms of existing large-scale file transfer protocols. The recursive structure of MLED is analyzed and it shows that adding additional levels of error detection reduces the overall UEP. An adversarial error model is designed to introduce errors into files that evade detection by multiple error detection policies. Through experimentation using the FABRIC testbed the traditional approach, with transport- and data link- layer error detection, results in a corrupt file transfer requiring retransmission of the entire file. Using its recursive structure, an implementation of MLED detects and corrects these adversarial errors at intermediate levels inside the network, avoiding file retransmission under non-zero error rates. MLED therefore achieves a 100% gain in goodput over the traditional approach, reaching a goodput of over 800 Mbps on a single connection with no appreciable increase in delay.more » « less
-
Configuration changes are among the dominant causes of failures of large-scale software system deployment. Given the velocity of configuration changes, typically at the scale of hundreds to thousands of times daily in modern cloud systems, checking these configuration changes is critical to prevent failures due to misconfigurations. Recent work has proposed configuration testing, Ctest, a technique that tests configuration changes together with the code that uses the changed configurations. Ctest can automatically generate a large number of ctests that can effectively detect misconfigurations, including those that are hard to detect by traditional techniques. However, running ctests can take a long time to detect misconfigurations. Inspired by traditional test-case prioritization (TCP) that aims to reorder test executions to speed up detection of regression code faults, we propose to apply TCP to reorder ctests to speed up detection of misconfigurations. We extensively evaluate a total of 84 traditional and novel ctest-specific TCP techniques. The experimental results on five widely used cloud projects demonstrate that TCP can substantially speed up misconfiguration detection. Our study provides guidelines for applying TCP to configuration testing in practice.more » « less
-
Test-case prioritization (TCP) aims to detect regression bugs faster via reordering the tests run. While TCP has been studied for over 20 years, it was almost always evaluated using seeded faults/mutants as opposed to using real test failures. In this work, we study the recent change-aware information retrieval (IR) technique for TCP. Prior work has shown it performing better than traditional coverage-based TCP techniques, but it was only evaluated on a small-scale dataset with a cost-unaware metric based on seeded faults/mutants. We extend the prior work by conducting a much larger and more realistic evaluation as well as proposing enhancements that substantially improve the performance. In particular, we evaluate the original technique on a large-scale, real-world software-evolution dataset with real failures using both cost-aware and cost-unaware metrics under various configurations. Also, we design and evaluate hybrid techniques combining the IR features, historical test execution time, and test failure frequencies. Our results show that the change-aware IR technique outperforms stateof-the-art coverage-based techniques in this real-world setting, and our hybrid techniques improve even further upon the original IR technique. Moreover, we show that flaky tests have a substantial impact on evaluating the change-aware TCP techniques based on real test failures.more » « less
-
null (Ed.)Science DMZs are specialized networks that enable large-scale distributed scientific research, providing efficient and guaranteed performance while transferring large amounts of data at high rates. The high-speed performance of a Science DMZ is made viable via data transfer nodes (DTNs), therefore they are a critical point of failure. DTNs are usually monitored with network intrusion detection systems (NIDS). However, NIDS do not consider system performance data, such as network I/O interrupts and context switches, which can also be useful in revealing anomalous system performance potentially arising due to external network based attacks or insider attacks. In this paper, we demonstrate how system performance metrics can be applied towards securing a DTN in a Science DMZ network. Specifically, we evaluate the effectiveness of system performance data in detecting TCP-SYN flood attacks on a DTN using DBSCAN (a density-based clustering algorithm) for anomaly detection. Our results demonstrate that system interrupts and context switches can be used to successfully detect TCP-SYN floods, suggesting that system performance data could be effective in detecting a variety of attacks not easily detected through network monitoring alone.more » « less
An official website of the United States government
