skip to main content


Title: Radix+ : High‐throughput georeferencing and data ingestion over voluminous and fast‐evolving phenotyping sensor data
Summary Remote sensing of plant traits and their environment facilitates non‐invasive, high‐throughput monitoring of the plant's physiological characteristics. However, voluminous observational data generated by such autonomous sensor networks overwhelms scientific users when they have to analyze the data. In order to provide a scalable and effective analysis environment, there is a need for storage and analytics that support high‐throughput data ingestion while preserving spatiotemporal and sensor‐specific characteristics. Also, the framework should enable modelers and scientists to run their analytics while coping with the fast and continuously evolving nature of the dataset. In this paper, we present Radix+ , a high‐throughput distributed data storage system for supporting scalable georeferencing, and interactive query‐based spatiotemporal analytics with trackable data integrity. We include empirical evaluations performed on a commodity machine cluster with up to 1 TB of data. Our benchmarks demonstrate subsecond latency for majority of our evaluated queries and improvement in data ingestion rate over systems such as Geomesa.  more » « less
Award ID(s):
1931363
NSF-PAR ID:
10448754
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Concurrency and Computation: Practice and Experience
Volume:
35
Issue:
8
ISSN:
1532-0626
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The development and validation studies of new multisensory biomarkers and sensor-triggered interventions requires collecting raw sensor data with associated labels in the natural field environment. Unlike platforms for traditional mHealth apps, a software platform for such studies needs to not only support high-rate data ingestion, but also share raw high-rate sensor data with researchers, while supporting high-rate sense-analyze-act functionality in real-time. We present mCerebrum, a realization of such a platform, which supports high-rate data collections from multiple sensors with realtime assessment of data quality. A scalable storage architecture (with near optimal performance) ensures quick response despite rapidly growing data volume. Micro-batching and efficient sharing of data among multiple source and sink apps allows reuse of computations to enable real-time computation of multiple biomarkers without saturating the CPU or memory. Finally, it has a reconfigurable scheduler which manages all prompts to participants that is burden- and context-aware. With a modular design currently spanning 23+ apps, mCerebrum provides a comprehensive ecosystem of system services and utility apps. The design of mCerebrum has evolved during its concurrent use in scientific field studies at ten sites spanning 106,806 person days. Evaluations show that compared with other platforms, mCerebrum's architecture and design choices support 1.5 times higher data rates and 4.3 times higher storage throughput, while causing 8.4 times lower CPU usage. 
    more » « less
  2. Cloud computing has become a major approach to help reproduce computational experiments. Yet there are still two main difficulties in reproducing batch based big data analytics (including descriptive and predictive analytics) in the cloud. The first is how to automate end-to-end scalable execution of analytics including distributed environment provisioning, analytics pipeline description, parallel execution, and resource termination. The second is that an application developed for one cloud is difficult to be reproduced in another cloud, a.k.a. vendor lock-in problem. To tackle these problems, we leverage serverless computing and containerization techniques for automated scalable execution and reproducibility, and utilize the adapter design pattern to enable application portability and reproducibility across different clouds. We propose and develop an open-source toolkit that supports 1) fully automated end-to-end execution and reproduction via a single command, 2) automated data and configuration storage for each execution, 3) flexible client modes based on user preferences, 4) execution history query, and 5) simple reproduction of existing executions in the same environment or a different environment. We did extensive experiments on both AWS and Azure using four big data analytics applications that run on virtual CPU/GPU clusters. The experiments show our toolkit can achieve good execution performance, scalability, and efficient reproducibility for cloud-based big data analytics. 
    more » « less
  3. We describe ENRICHDB, a new DBMS technology designed for emerging domains (e.g., sensor-driven smart spaces and social media analytics) that require incoming data to be enriched using expensive functions prior to its usage. To support online processing, today, such enrichment is performed outside of DBMSs, as a static data processing workflow prior to its ingestion into a DBMS. Such a strategy could result in a significant delay from the time when data arrives and when it is enriched and ingested into the DBMS, especially when the enrichment complexity is high. Also, enriching at ingestion could result in wastage of resources if applications do not use/require all data to be enriched. ENRICHDB's design represents a significant departure from the above, where we explore seamless integration of data enrichment all through the data processing pipeline - at ingestion, triggered based on events in the background, and progressively during query processing. The cornerstone of ENRICHDB is a powerful enrichment data and query model that encapsulates enrichment as an operator inside a DBMS enabling it to co-optimize enrichment with query processing. This paper describes this data model and provides a summary of the system implementation. 
    more » « less
  4. The theoretical foundations of Big Data Science are not fully developed, yet. This study proposes a new scalable framework for Big Data representation, high-throughput analytics (variable selection and noise reduction), and model-free inference. Specifically, we explore the core principles of distribution-free and model-agnostic methods for scientific inference based on Big Data sets. Compressive Big Data analytics (CBDA) iteratively generates random (sub)samples from a big and complex dataset. This subsampling with replacement is conducted on the feature and case levels and results in samples that are not necessarily consistent or congruent across iterations. The approach relies on an ensemble predictor where established model-based or model-free inference techniques are iteratively applied to preprocessed and harmonized samples. Repeating the subsampling and prediction steps many times, yields derived likelihoods, probabilities, or parameter estimates, which can be used to assess the algorithm reliability and accuracy of findings via bootstrapping methods, or to extract important features via controlled variable selection. CBDA provides a scalable algorithm for addressing some of the challenges associated with handling complex, incongruent, incomplete and multi-source data and analytics challenges. Albeit not fully developed yet, a CBDA mathematical framework will enable the study of the ergodic properties and the asymptotics of the specific statistical inference approaches via CBDA. We implemented the high-throughput CBDA method using pure R as well as via the graphical pipeline environment. To validate the technique, we used several simulated datasets as well as a real neuroimaging-genetics of Alzheimer’s disease case-study. The CBDA approach may be customized to provide generic representation of complex multimodal datasets and to provide stable scientific inference for large, incomplete, and multisource datasets. 
    more » « less
  5. The adoption of big data analytics in healthcare applications is overwhelming not only because of the huge volume of data being analyzed, but also because of the heterogeneity and sensitivity of the data. Effective and efficient analysis and visualization of secure patient health records are needed to e.g., find new trends in disease management, determining risk factors for diseases, and personalized medicine. In this paper, we propose a novel community cloud architecture to help clinicians and researchers to have easy/increased accessibility to data sets from multiple sources, while also ensuring security compliance of data providers is not compromised. Our cloud-based system design configuration with cloudlet principles ensures application performance has high-speed processing, and data analytics is sufficiently scalable while adhering to security standards (e.g., HIPAA, NIST). Through a case study, we show how our community cloud architecture can be implemented along with best practices in an ophthalmology case study which includes health big data (i.e., Health Facts database, I2B2, Millennium) hosted in a campus cloud infrastructure featuring virtual desktop thin-clients and relevant Data Classification Levels in storage. 
    more » « less