skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Bridging BAD Islands: Declarative Data Sharing at Scale
In many Big Data applications today, information needs to be actively shared between systems managed by different organizations. To enable sharing Big Data at scale, developers would have to create dedicated server programs and glue together multiple Big Data systems for scalability. Developing and managing such glued data sharing services requires a significant amount of work from developers. In our prior work, we developed a Big Active Data (BAD) system for enabling Big Data subscriptions and analytics with millions of subscribers. Based on that, we introduce a new mechanism for enabling the sharing of Big Data at scale declaratively so that developers can easily create and provide data sharing services using declarative statements and can benefit from an underlying scalable infrastructure. We show our implementation on top of the BAD system, explain the data sharing data flow among multiple systems, and present a prototype system with experimental results.  more » « less
Award ID(s):
1924694 1925610
PAR ID:
10293600
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
IEEE International Conference on Big Data (Big Data)
Page Range / eLocation ID:
2002 to 2011
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Today, data is being actively generated by a variety of devices, services, and applications. Such data is important not only for the information that it contains, but also for its relationships to other data and to interested users. Most existing Big Data systems focus onpassivelyanswering queries from users, rather thanactivelycollecting data, processing it, and serving it to users. To satisfy both passive and active requests at scale, application developers need either to heavily customize an existing passive Big Data system or to glue one together with systems likeStreaming EnginesandPub-sub services. Either choice requires significant effort and incurs additional overhead. In this paper, we present the BAD (Big Active Data) system as an end-to-end, out-of-the-box solution for this challenge. It is designed to preserve the merits of passive Big Data systems and introduces new features for actively serving Big Data to users at scale. We show the design and implementation of the BAD system, demonstrate how BAD facilitates providing both passive and active data services, investigate the BAD system’s performance at scale, and illustrate the complexities that would result from instead providing BAD-like services with a “glued” system. 
    more » « less
  2. null (Ed.)
    Traditional big data infrastructures are passive in nature, passively answering user requests to process and return data. In many applications however, users not only need to analyze data, but also to subscribe to and actively receive data of interest, based on their subscriptions. Their interest may include the incoming data's content as well as its relationships to other data. Moreover, data delivered to subscribers may need to be enriched with additional relevant and actionable information. To address this Big Active Data (BAD) challenge we have advocated the need for building scalable BAD systems that continuously and reliably capture big data while enabling timely and automatic delivery of relevant and possibly enriched information to a large pool of subscribers. In this demo we showcase how to build an end-to-end active application using a BAD system and a standard email broker for data delivery. This includes enabling users to register their interests with the bad system, ingesting and monitoring data, and producing customized results and delivering them to the appropriate subscribers. Through this example we demonstrate that even complex active data applications can be created easily and scale to many users, considerably limiting the effort of application developers, if a BAD approach is taken. 
    more » « less
  3. Configuration space complexity makes the big-data software systems hard to configure well. Consider Hadoop, with over nine hundred parameters, developers often just use the default configurations provided with Hadoop distributions. The opportunity costs in lost performance are significant. Popular learning-based approaches to auto-tune software does not scale well for big-data systems because of the high cost of collecting training data. We present a new method based on a combination of Evolutionary Markov Chain Monte Carlo (EMCMC)} sampling and cost reduction techniques tofind better-performing configurations for big data systems. For cost reduction, we developed and experimentally tested and validated two approaches: using scaled-up big data jobs as proxies for the objective function for larger jobs and using a dynamic job similarity measure to infer that results obtained for one kind of big data problem will work well for similar problems. Our experimental results suggest that our approach promises to improve the performance of big data systems significantly and that it outperforms competing approaches based on random sampling, basic genetic algorithms (GA), and predictive model learning. Our experimental results support the conclusion that our approach strongly demonstrates the potential toimprove the performance of big data systems significantly and frugally. 
    more » « less
  4. VREs are predestined to support many aspects of FAIR because of their characteristics to provide a workspace for collaboration, sharing data and simulations and/or workflows. The FAIR for VRE Working Group has worked on a checklist to measure FAIRness in science gateways. This list considers how to address the complexity in regard to which target group is addressed – developers or users – and the granularity such as VREs as software frameworks, services, APIs, workflows, data and simulations. We assume that not only VREs as software frameworks are FAIR but that they also are FAIR-enabling for the digital objects they contain. The objective of this session will be how to recognize and incentivize that providers, developers and users are actively working towards FAIRness of digital objects. The idea for this session is to address this via badges. It probably makes sense to split the badges for the four principles Findable, Accessible, Interoperable and Reusable. There are many open questions beyond this granularity such as how to create badges, who gives such badges, what are the rules for the duration of badges? 
    more » « less
  5. Large-scale traffic simulations are necessary for the planning, design, and operation of city-scale transportation systems. These simulations enable novel and complex transportation technology and services such as optimization of traffic control systems, supporting on-demand transit, and redesigning regional transit systems for better energy efficiency and emissions. For a city-wide simulation model, big data from multiple sources such as Open Street Map (OSM), traffic surveys, geo-location traces, vehicular traffic data, and transit details are integrated to create a unique and accurate representation. However, in order to accurately identify the model structure and have reliable simulation results, these traffic simulation models must be thoroughly calibrated and validated against real-world data. This paper presents a novel calibration approach for a city-scale traffic simulation model based on limited real-world speed data. The simulation model runs a microscopic and mesoscopic realistic traffic simulation from Chattanooga, TN (US) for a 24-hour period and includes various transport modes such as transit buses, passenger cars, and trucks. The experiment results presented demonstrate the effectiveness of our approach for calibrating large-scale traffic networks using only real-world speed data. This paper presents our proposed calibration approach that utilizes 2160 real-world speed data points, performs sensitivity analysis of the simulation model to input parameters, and genetic algorithm for optimizing the model for calibration. 
    more » « less