Random Access in Nondelimited Variable-length Record Collections for Parallel Reading with Hadoop

Anderson, Jason; Gropp, Christopher; Ngo, Linh; Apon, Amy

Citation Details

The industry standard Packet CAPture (PCAP) format for storing network packet traces is normally only readable in serial due to its lack of delimiters, indexing, or blocking. This presents a challenge for parallel analysis of large networks, where packet traces can be many gigabytes in size. In this work we present RAPCAP, a novel method for random access into variable-length record collections like PCAP by identifying a record boundary within a small number of bytes of the access point. Unlike related heuristic methods that can limit scalability with a nonzero probability of error, the new method offers a correctness guarantee with a well formed file and does not rely on prior knowledge of the contents. We include a practical implementation of the algorithm with an extension to the Hadoop framework, and a performance comparison to serial ingestion. Finally, we present a number of similar storage types that could utilize a modified version of RAPCAP for random access. more »

Award ID(s):: 1642542

PAR ID:: 10028362

Author(s) / Creator(s):: Anderson, Jason; Gropp, Christopher; Ngo, Linh; Apon, Amy

Date Published:: 2017-05-01

Journal Name:: The 2nd IFIP/IEEE International Workshop on Analytics for Network and Service Management (AnNet 2017)

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this