skip to main content


Title: A HIGHLY SCALABLE DATA MANAGEMENT SYSTEM FOR POINT CLOUD AND FULL WAVEFORM LIDAR DATA
Abstract. The massive amounts of spatio-temporal information often present in LiDAR data sets make their storage, processing, and visualisation computationally demanding. There is an increasing need for systems and tools that support all the spatial and temporal components and the three-dimensional nature of these datasets for effortless retrieval and visualisation. In response to these needs, this paper presents a scalable, distributed database system that is designed explicitly for retrieving and viewing large LiDAR datasets on the web. The ultimate goal of the system is to provide rapid and convenient access to a large repository of LiDAR data hosted in a distributed computing platform. The system is composed of multiple, share-nothing nodes operating in parallel. Namely, each node is autonomous and has a dedicated set of processors and memory. The nodes communicate with each other via an interconnected network. The data management system presented in this paper is implemented based on Apache HBase, a distributed key-value datastore within the Hadoop eco-system. HBase is extended with new data encoding and indexing mechanisms to accommodate both the point cloud and the full waveform components of LiDAR data. The data can be consumed by any desktop or web application that communicates with the data repository using the HTTP protocol. The communication is enabled by a web servlet. In addition to the command line tool used for administration tasks, two web applications are presented to illustrate the types of user-facing applications that can be coupled with the data system.  more » « less
Award ID(s):
1826134
PAR ID:
10205031
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
ISPRS - International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences
Volume:
XLIII-B4-2020
ISSN:
2194-9034
Page Range / eLocation ID:
507 to 512
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A new approach to web application development is presented, in which an application is constructed by configuring and composing concepts drawn from a catalog developed by experts. A concept is a self-contained, reusable increment of functionality. Each concept includes both front-end and back-end functionality, and exports a collection of components—full-stack GUI elements, backed by application logic and database storage. To build an app, the developer imports concepts from the catalog, tunes them to fit the application’s particular needs via configuration variables, and links concept components together to create pages. Components of different concepts may be executed independently, or bound together declaratively with dataflows and synchronization. The instantiation, configuration, linking and binding of components is all expressed in a simple template language that extends HTML. The approach has been implemented in a platform called Déjà Vu, which we outline and compare to conventional web application architectures. We describe a case study in which a collection of applications previously built as team projects for a web programming course were replicated in Déjà Vu. Preliminary results validate our hypothesis, suggesting that a variety of non-trivial applications can be built from a repository of generic concepts. 
    more » « less
  2. null (Ed.)
    The ever rising volume of geospatial data is undeniable. So is the need to explore and analyze these datasets. However, these datasets vary widely in their size, coverage, and accuracy. Therefore, users need to assess these aspects of the data to choose the right dataset to use in their analysis. Unfortunately, all the publicly available repositories for geospatial datasets provide a list of datasets with some information about them with no way to explore the datasets beforehand. Through this demonstration, we propose the repository, UCR-Star, that is capable of hosting hundreds of thousands of geospatial datasets that a user can explore visually to judge their quality before even downloading them. This demo provides a deeper dive into the core engine behind UCR-Star. It provides a web interface geared towards database researchers to understand how the index internally works. It provides a comparison interface where the attendees can see side-by-side how two versions of the system work with the ability to customize each of them separately. Finally, the interface reports the response time of the indexes for a quantitative comparison. 
    more » « less
  3. In the era of big data and cloud computing, large amounts of data are generated from user applications and need to be processed in the datacenter. Data-parallel computing frameworks, such as Apache Spark, are widely used to perform such data processing at scale. Specifically, Spark leverages distributed memory to cache the intermediate results, represented as Resilient Distributed Datasets (RDDs). This gives Spark an advantage over other parallel frameworks for implementations of iterative machine learning and data mining algorithms, by avoiding repeated computation or hard disk accesses to retrieve RDDs. By default, caching decisions are left at the programmer’s discretion, and the LRU policy is used for evicting RDDs when the cache is full. However, when the objective is to minimize total work, LRU is woefully inadequate, leading to arbitrarily suboptimal caching decisions. In this paper, we design an algorithm for multi-stage big data processing platforms to adaptively determine and cache the most valuable intermediate datasets that can be reused in the future. Our solution automates the decision of which RDDs to cache: this amounts to identifying nodes in a direct acyclic graph (DAG) representing computations whose outputs should persist in the memory. Our experiment results show that our proposed cache optimization solution can improve the performance of machine learning applications on Spark decreasing the total work to recompute RDDs by 12%. 
    more » « less
  4. An increasing number of distributed applications operate by dispatching function invocations across the nodes of a distributed system. To operate correctly, the code and data dependencies of the function must be distributed along with the invocations in some way. When translating applications to work on large scale distributed systems, managing these dependencies becomes challenging: delivery must be scalable to thousands of nodes; the dependencies must be consistent across the system; and the method must be usable by an unprivileged developer. As a solution, in this paper we present PONCHO, which is a lightweight Python based toolkit which allows users to discover, package, and deploy dependencies as an integral part of distributed applications. PONCHO encapsulates a set of commands to be executed within an environment. PONCHO offers a lightweight solution to create and manage environments increasing the portability of scientific applications as well as reproducibility. In this paper, we evaluate PONCHO with real-world applications in the fields of physics, computational chemistry, and hyperparameter optimization, We observe the challenges that arise when creating and distributing an environment and measure the overheads that emerge as a result. 
    more » « less
  5. Database-backed web applications manipulate large amounts of persistent data, and such applications often contain constraints that restrict data length, data value, and other data properties. Such constraints are critical in ensuring the reliability and usability of these applications. In this paper, we present a comprehensive study on where data constraints are expressed, what they are about, how often they evolve, and how their violations are handled. The results show that developers struggle with maintaining consistent data constraints and checking them across different components and versions of their web applications, leading to various problems. Guided by our study, we developed checking tools and API enhancements that can automatically detect such problems and improve the quality of such applications. 
    more » « less