skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Friday, July 12 until 2:00 AM ET on Saturday, July 13 due to maintenance. We apologize for the inconvenience.

Title: Efficient LiDAR point cloud data encoding for scalable data management within the Hadoop eco-system
This paper introduces a novel LiDAR point cloud data encoding solution that is compact, flexible, and fully supports distributed data storage within the Hadoop distributed computing environment. The proposed data encoding solution is developed based on Sequence File and Google Protocol Buffers. Sequence File is a generic splittable binary file format built in the Hadoop framework for storage of arbitrary binary data. The key challenge in adopting the Sequence File format for LiDAR data is in the strategy for effectively encoding the LiDAR data as binary sequences in a way that the data can be represented compactly, while allowing necessary mutation. For that purpose, a data encoding solution, based on Google Protocol Buffers (a language-neutral, cross-platform, extensible data serialisation framework) was developed and evaluated. Since neither of the underlying technologies is sufficient to completely and efficiently represent all necessary point formats for distributed computing, an innovative fusion of them was required to provide a viable data storage solution. This paper presents the details of such a data encoding implementation and rigorously evaluates the efficiency of the proposed data encoding solution. Benchmarking was done against a straightforward, naive text encoding implementation using a high-density aerial LiDAR scan of a portion of Dublin, Ireland. The results demonstrated a 6-times reduction in data volume, a 4-times reduction in database ingestion time, and up to a 5 times reduction in querying time.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
2019 IEEE International Conference on Big Data
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract. The massive amounts of spatio-temporal information often present in LiDAR data sets make their storage, processing, and visualisation computationally demanding. There is an increasing need for systems and tools that support all the spatial and temporal components and the three-dimensional nature of these datasets for effortless retrieval and visualisation. In response to these needs, this paper presents a scalable, distributed database system that is designed explicitly for retrieving and viewing large LiDAR datasets on the web. The ultimate goal of the system is to provide rapid and convenient access to a large repository of LiDAR data hosted in a distributed computing platform. The system is composed of multiple, share-nothing nodes operating in parallel. Namely, each node is autonomous and has a dedicated set of processors and memory. The nodes communicate with each other via an interconnected network. The data management system presented in this paper is implemented based on Apache HBase, a distributed key-value datastore within the Hadoop eco-system. HBase is extended with new data encoding and indexing mechanisms to accommodate both the point cloud and the full waveform components of LiDAR data. The data can be consumed by any desktop or web application that communicates with the data repository using the HTTP protocol. The communication is enabled by a web servlet. In addition to the command line tool used for administration tasks, two web applications are presented to illustrate the types of user-facing applications that can be coupled with the data system. 
    more » « less
  2. With the rapid increase of available digital data, we are searching for a storage media with high density and capability of long-term preservation. Deoxyribonucleic Acid (DNA) storage is identified as such a promising candidate, especially for archival storage systems. However, the encoding density (i.e., how many binary bits can be encoded into one nucleotide) and error handling are two major factors intertwined in DNA storage. Considering encoding density, theoretically, one nucleotide (i.e., A, T, G, or C) can encode two binary bits (upper bound). However, due to biochemical constraints and other necessary information associated with payload, currently the encoding densities of various DNA storage systems are much less than this upper bound. Additionally, all existing studies of DNA encoding schemes are based on static analysis and really lack the awareness of dynamically changed digital patterns. Therefore, the gap between the static encoding and dynamic binary patterns prevents achieving a higher encoding density for DNA storage systems. In this paper, we propose a new Digital Pattern-Aware DNA storage system, called DP-DNA, which can efficiently store digital data in the DNA storage with high encoding density. DP-DNA maintains a set of encoding codes and uses a digital pattern-aware code (DPAC) to analyze the patterns of a binary sequence for a DNA strand and selects an appropriate code for encoding the binary sequence to achieve a high encoding density. An additional encoding field is added to the DNA encoding format, which can distinguish the encoding scheme used for those DNA strands, and thus we can decode DNA data back to its original digital data. Moreover, to further improve the encoding density, a variable-length scheme is proposed to increase the feasibility of the code scheme with a high encoding density. Finally, the experimental results indicate that the proposed DP-DNA achieves up to 103.5% higher encoding densities than prior work. 
    more » « less
  3. Modern data analytics applications prefer to use column-storage formats due to their improved storage efficiency through encoding and compression. Parquet is the most popular file format for col- umn data storage that provides several of these benefits out of the box. However, geospatial data is not readily supported by Parquet. This paper introduces Spatial Parquet, a Parquet extension that efficiently supports geospatial data. Spatial Parquet inherits all the advantages of Parquet for non-spatial data, such as rich data types, compression, and column/row filtering. Additionally, it adds three new features to accommodate geospatial data. First, it introduces a geospatial data type that can encode all standard spatial geome- tries in a column format compatible with Parquet. Second, it adds a new lossless and efficient encoding method, termed FP-delta, that is customized to efficiently store geospatial coordinates stored in floating-point format. Third, it adds a light-weight spatial index that allows the reader to skip non-relevant parts of the file for increased read efficiency. Experiments on large-scale real data showed that Spatial Parquet can reduce the data size by a factor of three even without compression. Compression can further reduce the storage size. Additionally, Spatial Parquet can reduce the reading time by two orders of magnitude when the light-weight index is applied. This initial prototype can open new research directions to further improve geospatial data storage in column format. 
    more » « less
  4. null (Ed.)
    Distributed file systems present distinctive forensic challenges in comparison to traditional locally mounted file system volume. Storage device media can number in the thousands, and forensic investigations in this setting necessitate a tailored approach to data collection. The Hadoop Distributed File System (HFDS) produces and maintains partially persistent metadata that is pursuant with a logical volume, a file system, and file addresses on the centralized server. Hence, this research investigates the viability of using a residual central server digital artifact to generate a history model of the distributed file system. The history model affords an investigator a high-level perspective of low-level events to narrow investigative process obligations. The model is generated through set-theoretic relations of the file system essential data structure. Graph-theoretic ordering is applied to the events to provide a history model. The research contribution is a rapid reconstruction of the HDFS storage state transitions generating timelines for system events to forensically assess HDFS properties with conceptual similarity to traditional low-level file system forensic tool output. The results of this research provide a prototype tool, DFS3, for rapid and noninvasive data storage state timeline reconstruction in a big data distributed file system. 
    more » « less
  5. The majority of sensitive and personal user data is stored in different Database Management Systems (DBMS). For example, Oracle is frequently used to store corporate data, MySQL serves as the back-end storage for most webstores, and SQLite stores personal data such as SMS messages on a phone or browser bookmarks. Each DBMS manages its own storage (within the operating system), thus databases require their own set of forensic tools. While database carving solutions have been built by multiple research groups, forensic investigators today still lack the tools necessary to analyze DBMS forensic artifacts. The unique nature of database storage and the resulting forensic artifacts require established standards for artifact storage and viewing mechanisms in order for such advanced analysis tools to be developed. In this paper, we present 1) a standard storage format, Database Forensic File Format (DB3F), for database forensic tools output that follows the guidelines established by other (file system) forensic tools, and 2) a view and search toolkit, Database Forensic Toolkit (DF-Toolkit), that enables the analysis of data stored in our database forensic format. Using our prototype implementation, we demonstrate that our toolkit follows the state-of-the-art design used by current forensic tools and offers easy-to-interpret database artifact search capabilities. 
    more » « less