The development and validation studies of new multisensory biomarkers and sensor-triggered interventions requires collecting raw sensor data with associated labels in the natural field environment. Unlike platforms for traditional mHealth apps, a software platform for such studies needs to not only support high-rate data ingestion, but also share raw high-rate sensor data with researchers, while supporting high-rate sense-analyze-act functionality in real-time. We present mCerebrum, a realization of such a platform, which supports high-rate data collections from multiple sensors with realtime assessment of data quality. A scalable storage architecture (with near optimal performance) ensures quick response despite rapidly growing data volume. Micro-batching and efficient sharing of data among multiple source and sink apps allows reuse of computations to enable real-time computation of multiple biomarkers without saturating the CPU or memory. Finally, it has a reconfigurable scheduler which manages all prompts to participants that is burden- and context-aware. With a modular design currently spanning 23+ apps, mCerebrum provides a comprehensive ecosystem of system services and utility apps. The design of mCerebrum has evolved during its concurrent use in scientific field studies at ten sites spanning 106,806 person days. Evaluations show that compared with other platforms, mCerebrum's architecture and design choices support 1.5 times higher data rates and 4.3 times higher storage throughput, while causing 8.4 times lower CPU usage. 
                        more » 
                        « less   
                    
                            
                            Radix+ : High‐throughput georeferencing and data ingestion over voluminous and fast‐evolving phenotyping sensor data
                        
                    
    
            Summary Remote sensing of plant traits and their environment facilitates non‐invasive, high‐throughput monitoring of the plant's physiological characteristics. However, voluminous observational data generated by such autonomous sensor networks overwhelms scientific users when they have to analyze the data. In order to provide a scalable and effective analysis environment, there is a need for storage and analytics that support high‐throughput data ingestion while preserving spatiotemporal and sensor‐specific characteristics. Also, the framework should enable modelers and scientists to run their analytics while coping with the fast and continuously evolving nature of the dataset. In this paper, we present Radix+ , a high‐throughput distributed data storage system for supporting scalable georeferencing, and interactive query‐based spatiotemporal analytics with trackable data integrity. We include empirical evaluations performed on a commodity machine cluster with up to 1 TB of data. Our benchmarks demonstrate subsecond latency for majority of our evaluated queries and improvement in data ingestion rate over systems such as Geomesa. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1931363
- PAR ID:
- 10448754
- Date Published:
- Journal Name:
- Concurrency and Computation: Practice and Experience
- Volume:
- 35
- Issue:
- 8
- ISSN:
- 1532-0626
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Cloud computing has become a major approach to help reproduce computational experiments. Yet there are still two main difficulties in reproducing batch based big data analytics (including descriptive and predictive analytics) in the cloud. The first is how to automate end-to-end scalable execution of analytics including distributed environment provisioning, analytics pipeline description, parallel execution, and resource termination. The second is that an application developed for one cloud is difficult to be reproduced in another cloud, a.k.a. vendor lock-in problem. To tackle these problems, we leverage serverless computing and containerization techniques for automated scalable execution and reproducibility, and utilize the adapter design pattern to enable application portability and reproducibility across different clouds. We propose and develop an open-source toolkit that supports 1) fully automated end-to-end execution and reproduction via a single command, 2) automated data and configuration storage for each execution, 3) flexible client modes based on user preferences, 4) execution history query, and 5) simple reproduction of existing executions in the same environment or a different environment. We did extensive experiments on both AWS and Azure using four big data analytics applications that run on virtual CPU/GPU clusters. The experiments show our toolkit can achieve good execution performance, scalability, and efficient reproducibility for cloud-based big data analytics.more » « less
- 
            We describe ENRICHDB, a new DBMS technology designed for emerging domains (e.g., sensor-driven smart spaces and social media analytics) that require incoming data to be enriched using expensive functions prior to its usage. To support online processing, today, such enrichment is performed outside of DBMSs, as a static data processing workflow prior to its ingestion into a DBMS. Such a strategy could result in a significant delay from the time when data arrives and when it is enriched and ingested into the DBMS, especially when the enrichment complexity is high. Also, enriching at ingestion could result in wastage of resources if applications do not use/require all data to be enriched. ENRICHDB's design represents a significant departure from the above, where we explore seamless integration of data enrichment all through the data processing pipeline - at ingestion, triggered based on events in the background, and progressively during query processing. The cornerstone of ENRICHDB is a powerful enrichment data and query model that encapsulates enrichment as an operator inside a DBMS enabling it to co-optimize enrichment with query processing. This paper describes this data model and provides a summary of the system implementation.more » « less
- 
            Classification of construction resource states, using sensor data analytics, has implications for improving informed decision-making for safety and productivity. However, training on sensor data analytics in construction education faces challenges owing to the complexity of analytical processes and the large stream of raw data involved. This research presents the development and user evaluation of ActionSens, a block-based end-user programming platform, for training students from construction-related disciplines to classify resources using sensor data analytics. ActionSens was designed for construction students to perform sensor data analytics such as activity recognition in construction. ActionSens was compared to traditional tools (i.e., combining Excel and MATLAB) used for performing sensor data analytics in terms of usability, workload, visual attention, and processing time using the System Usability Scale, NASA Task Load Index, eye-tracking, and qualitative feedback. Twenty students participated, performing data analytics tasks with both approaches. ActionSens exhibited a better user experience compared to conventional platforms, through higher usability scores and lower cognitive workload. This was evident through participants' interaction behavior, showcasing optimized attentional resource allocation across key tasks. The study contributes to knowledge by illustrating how the integration of construction domain information into block-based programming environments can equip students with the necessary skills for sensor data analytics. The development of ActionSens contributes to the Learning-for-Use framework by employing graphical and interactive programming objects to foster procedural knowledge for addressing challenges in sensor data analytics. The formative evaluation provides insights into how students engage with the programming environment and assesses the impact of the environment on their cognitive load.more » « less
- 
            The theoretical foundations of Big Data Science are not fully developed, yet. This study proposes a new scalable framework for Big Data representation, high-throughput analytics (variable selection and noise reduction), and model-free inference. Specifically, we explore the core principles of distribution-free and model-agnostic methods for scientific inference based on Big Data sets. Compressive Big Data analytics (CBDA) iteratively generates random (sub)samples from a big and complex dataset. This subsampling with replacement is conducted on the feature and case levels and results in samples that are not necessarily consistent or congruent across iterations. The approach relies on an ensemble predictor where established model-based or model-free inference techniques are iteratively applied to preprocessed and harmonized samples. Repeating the subsampling and prediction steps many times, yields derived likelihoods, probabilities, or parameter estimates, which can be used to assess the algorithm reliability and accuracy of findings via bootstrapping methods, or to extract important features via controlled variable selection. CBDA provides a scalable algorithm for addressing some of the challenges associated with handling complex, incongruent, incomplete and multi-source data and analytics challenges. Albeit not fully developed yet, a CBDA mathematical framework will enable the study of the ergodic properties and the asymptotics of the specific statistical inference approaches via CBDA. We implemented the high-throughput CBDA method using pure R as well as via the graphical pipeline environment. To validate the technique, we used several simulated datasets as well as a real neuroimaging-genetics of Alzheimer’s disease case-study. The CBDA approach may be customized to provide generic representation of complex multimodal datasets and to provide stable scientific inference for large, incomplete, and multisource datasets.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    