skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The DOI auto-population feature in the Public Access Repository (PAR) will be unavailable from 4:00 PM ET on Tuesday, July 8 until 4:00 PM ET on Wednesday, July 9 due to scheduled maintenance. We apologize for the inconvenience caused.


Title: Subscribing to big data at scale
Abstract Today, data is being actively generated by a variety of devices, services, and applications. Such data is important not only for the information that it contains, but also for its relationships to other data and to interested users. Most existing Big Data systems focus onpassivelyanswering queries from users, rather thanactivelycollecting data, processing it, and serving it to users. To satisfy both passive and active requests at scale, application developers need either to heavily customize an existing passive Big Data system or to glue one together with systems likeStreaming EnginesandPub-sub services. Either choice requires significant effort and incurs additional overhead. In this paper, we present the BAD (Big Active Data) system as an end-to-end, out-of-the-box solution for this challenge. It is designed to preserve the merits of passive Big Data systems and introduces new features for actively serving Big Data to users at scale. We show the design and implementation of the BAD system, demonstrate how BAD facilitates providing both passive and active data services, investigate the BAD system’s performance at scale, and illustrate the complexities that would result from instead providing BAD-like services with a “glued” system.  more » « less
Award ID(s):
1924694 1925610 1838222 1901379
PAR ID:
10370774
Author(s) / Creator(s):
; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Distributed and Parallel Databases
Volume:
40
Issue:
2-3
ISSN:
0926-8782
Page Range / eLocation ID:
p. 475-520
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Traditional big data infrastructures are passive in nature, passively answering user requests to process and return data. In many applications however, users not only need to analyze data, but also to subscribe to and actively receive data of interest, based on their subscriptions. Their interest may include the incoming data's content as well as its relationships to other data. Moreover, data delivered to subscribers may need to be enriched with additional relevant and actionable information. To address this Big Active Data (BAD) challenge we have advocated the need for building scalable BAD systems that continuously and reliably capture big data while enabling timely and automatic delivery of relevant and possibly enriched information to a large pool of subscribers. In this demo we showcase how to build an end-to-end active application using a BAD system and a standard email broker for data delivery. This includes enabling users to register their interests with the bad system, ingesting and monitoring data, and producing customized results and delivering them to the appropriate subscribers. Through this example we demonstrate that even complex active data applications can be created easily and scale to many users, considerably limiting the effort of application developers, if a BAD approach is taken. 
    more » « less
  2. null (Ed.)
    In many Big Data applications today, information needs to be actively shared between systems managed by different organizations. To enable sharing Big Data at scale, developers would have to create dedicated server programs and glue together multiple Big Data systems for scalability. Developing and managing such glued data sharing services requires a significant amount of work from developers. In our prior work, we developed a Big Active Data (BAD) system for enabling Big Data subscriptions and analytics with millions of subscribers. Based on that, we introduce a new mechanism for enabling the sharing of Big Data at scale declaratively so that developers can easily create and provide data sharing services using declarative statements and can benefit from an underlying scalable infrastructure. We show our implementation on top of the BAD system, explain the data sharing data flow among multiple systems, and present a prototype system with experimental results. 
    more » « less
  3. As radio spectrum becomes increasingly scarce, coexistence and bidirectional sharing between active and passive systems becomes a crucial target. In the past, spectrum regulations conferred radio astronomy a status on par with active services, thereby protecting their extreme sensitivity against any harmful interference. However, passive systems are likely to lose exclusive allocations as capacity constraints for active systems increase. The resulting increase in ambient radio frequency noise from various terrestrial and non-terrestrial emitters can only be mitigated with informed collaboration between active and passive users. While coexistence using time-division spectrum access has been proposed in the past, a more dynamic approach following the CBRS sharing principle promises greater spectral occupancy and efficiency, enabled by a spectrum access system capable of constantly monitoring the ambient RF environment. Instead of simply minimizing the potential for any ”harmful” interference to passive users, the goal is to use good engineering to enable sharing between active and passive users. To this end, this research created a Software Defined Radio (SDR)-based testbed at the the Hat Creek Radio Observatory to quantitatively characterize the radio-frequency environment, and flag potential sources of radio frequency interference in the vicinity of the Allen Telescope Array. Sensor validation was carried out via data analysis of I/Q data collected in well-characterized RF bands. Results so far from ground and drone-based surveys are consistent with the expected sources of interference, based on both the deployment of static RF transmitters in the Hat Creek/Redding area as well as the interference detected in telescope observations themselves. 
    more » « less
  4. Abstract Active restoration often aims to accelerate ecosystem recovery. However, active restoration may not be worthwhile if its effects are overwhelmed by changes that occur passively. Moreover, it can be challenging to separate the effects of passive processes, such as dispersal and natural succession, from active restoration efforts.We assess the 24‐year impact of actively restoring a Minnesota old‐field grassland via seed addition of native tallgrass prairie species. We compared the abundance of four functional plant groups in actively restored plots against abundances in three reference classes: (1) unrestored plots undergoing passive recovery within the same old field, (2) passively recovering plots in two nearby old fields of similar age and (3) a chronosequence of 21 old fields within the same landscape.Active restoration led to a higher abundance of native grasses and forbs in the 36 m2treatment plots. Seed addition was more effective if the original vegetation was first removed using herbicide, burning and tilling. However, long‐term conclusions about the efficacy of active restoration varied widely depending on the choice of reference class.In our small‐scale restoration experiment, native abundance was similarly high in both the actively restored and reference plots after 24 years, suggesting either (1) passive recovery or (2) local dispersal of native species from nearby treatment plots (i.e. cross‐contamination). In contrast, a comparison with two nearby reference fields suggested active restoration resulted in much higher native abundance relative to passive recovery. A smaller, positive effect was detected when we compared actively restored plots to the chronosequence of old fields. In the chronosequence, many passively recovering old fields had transitioned to native grass dominance naturally, although active restoration appeared to increase native forb abundance.Synthesis and applications: Our findings highlight the importance of using scale‐appropriate references for assessing the efficacy and need for active restoration. Comparing actively restored plots with the surrounding landscape, we found that active restoration and passive recovery led to similar plant communities after 24 years. Because local dispersal from actively restored sites can nearby references, caution should be exercised when evaluating long‐term restoration projects using only small‐scale experiments. 
    more » « less
  5. The popularity of JSON as a data interchange format resulted in big amounts of datasets available for processing. Users would like to analyze this data using SQL queries but existing distributed systems limit their users to only two specific formats, JSONLine and GeoJSON. The complexity of JSON schema makes it challenging to parse arbitrary files in a modern distributed system while producing records with unified schema that can be processed with SQL. To address these challenges, this paper introduces dsJSON, a state-of-the-art distributed JSON processor that overcomes limitations in existing systems and scales to big and complex data. dsJSON introduces the projection tree, a novel data structure that applies selective parsing of nested attributes to produce records that are ready for SQL processors. The key objective of the projection tree is to parse a big JSON file in parallel to produce records with a unified schema that can be processed with SQL. dsJSON is integrated into SparkSQL which enables users to run arbitrary SQL queries on complex JSON files. It also pushes projection and filter down into the parser for full integration between the parser and the processor. Experiments on up-to two terabytes of real data show that dsJSON performs several times faster than existing systems. It can also efficiently parse extremely large files not supported by existing distributed parsers 
    more » « less