skip to main content


Title: Building an end-to-end BAD application
Traditional big data infrastructures are passive in nature, passively answering user requests to process and return data. In many applications however, users not only need to analyze data, but also to subscribe to and actively receive data of interest, based on their subscriptions. Their interest may include the incoming data's content as well as its relationships to other data. Moreover, data delivered to subscribers may need to be enriched with additional relevant and actionable information. To address this Big Active Data (BAD) challenge we have advocated the need for building scalable BAD systems that continuously and reliably capture big data while enabling timely and automatic delivery of relevant and possibly enriched information to a large pool of subscribers. In this demo we showcase how to build an end-to-end active application using a BAD system and a standard email broker for data delivery. This includes enabling users to register their interests with the bad system, ingesting and monitoring data, and producing customized results and delivering them to the appropriate subscribers. Through this example we demonstrate that even complex active data applications can be created easily and scale to many users, considerably limiting the effort of application developers, if a BAD approach is taken.  more » « less
Award ID(s):
1924694 1925610 1831615 1838222 1901379
NSF-PAR ID:
10293598
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the 15th ACM International Conference on Distributed and Event-based Systems
Page Range / eLocation ID:
184 to 187
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Today, data is being actively generated by a variety of devices, services, and applications. Such data is important not only for the information that it contains, but also for its relationships to other data and to interested users. Most existing Big Data systems focus onpassivelyanswering queries from users, rather thanactivelycollecting data, processing it, and serving it to users. To satisfy both passive and active requests at scale, application developers need either to heavily customize an existing passive Big Data system or to glue one together with systems likeStreaming EnginesandPub-sub services. Either choice requires significant effort and incurs additional overhead. In this paper, we present the BAD (Big Active Data) system as an end-to-end, out-of-the-box solution for this challenge. It is designed to preserve the merits of passive Big Data systems and introduces new features for actively serving Big Data to users at scale. We show the design and implementation of the BAD system, demonstrate how BAD facilitates providing both passive and active data services, investigate the BAD system’s performance at scale, and illustrate the complexities that would result from instead providing BAD-like services with a “glued” system.

     
    more » « less
  2. null (Ed.)
    In many Big Data applications today, information needs to be actively shared between systems managed by different organizations. To enable sharing Big Data at scale, developers would have to create dedicated server programs and glue together multiple Big Data systems for scalability. Developing and managing such glued data sharing services requires a significant amount of work from developers. In our prior work, we developed a Big Active Data (BAD) system for enabling Big Data subscriptions and analytics with millions of subscribers. Based on that, we introduce a new mechanism for enabling the sharing of Big Data at scale declaratively so that developers can easily create and provide data sharing services using declarative statements and can benefit from an underlying scalable infrastructure. We show our implementation on top of the BAD system, explain the data sharing data flow among multiple systems, and present a prototype system with experimental results. 
    more » « less
  3. Billions of devices in the Internet of Things (IoT) are inter-connected over the internet and communicate with each other or end users. IoT devices communicate through messaging bots. These bots are important in IoT systems to automate and better manage the work flows. IoT devices are usually spread across many applications and are able to capture or generate substantial influx of big data. The integration of IoT with cloud computing to handle and manage big data, requires considerable security measures in order to prevent cyber attackers from adversarial use of such large amount of data. An attacker can simply utilize the messaging bots to perform malicious activities on a number of devices and thus bots pose serious cybersecurity hazards for IoT devices. Hence, it is important to detect the presence of malicious bots in the network. In this paper we propose an evidence theory-based approach for malicious bot detection. Evidence Theory, a.k.a. Dempster Shafer Theory (DST) is a probabilistic reasoning tool and has the unique ability to handle uncertainty, i.e. in the absence of evidence. It can be applied efficiently to identify a bot, especially when the bots have dynamic or polymorphic behavior. The key characteristic of DST is that the detection system may not need any prior information about the malicious signatures and profiles. In this work, we propose to analyze the network flow characteristics to extract key evidence for bot traces. We then quantify these pieces of evidence using apriori algorithm and apply DST to detect the presence of the bots. 
    more » « less
  4. Modern Internet of Things (IoT) applications, from contextual sensing to voice assistants, rely on ML-based training and serving systems using pre-trained models to render predictions. However, real-world IoT environments are diverse, with rich IoT sensors and need ML models to be personalized for each setting using relatively less training data. Most existing general-purpose ML systems are optimized for specific and dedicated hardware resources and do not adapt to changing resources and different IoT application requirements. To address this gap, we propose MLIoT, an end-to-end Machine Learning System tailored towards supporting the entire lifecycle of IoT applications. MLIoT adapts to different IoT data sources, IoT tasks, and compute resources by automatically training, optimizing, and serving models based on expressive applicationspecific policies. MLIoT also adapts to changes in IoT environments or compute resources by enabling re-training, and updating models served on the fly while maintaining accuracy and performance. Our evaluation across a set of benchmarks show that MLIoT can handle multiple IoT tasks, each with individual requirements, in a scalable manner while maintaining high accuracy and performance. We compare MLIoT with two state-of-the-art hand-tuned systems and a commercial ML system showing that MLIoT improves accuracy from 50% - 75% while reducing or maintaining latency. 
    more » « less
  5. Reddy, S. ; Winter, J.S. ; Padmanabhan, S. (Ed.)
    AI applications are poised to transform health care, revolutionizing benefits for individuals, communities, and health-care systems. As the articles in this special issue aptly illustrate, AI innovations in healthcare are maturing from early success in medical imaging and robotic process automation, promising a broad range of new applications. This is evidenced by the rapid deployment of AI to address critical challenges related to the COVID-19 pandemic, including disease diagnosis and monitoring, drug discovery, and vaccine development. At the heart of these innovations is the health data required for deep learning applications. Rapid accumulation of data, along with improved data quality, data sharing, and standardization, enable development of deep learning algorithms in many healthcare applications. One of the great challenges for healthcare AI is effective governance of these data—ensuring thoughtful aggregation and appropriate access to fuel innovation and improve patient outcomes and healthcare system efficiency while protecting the privacy and security of data subjects. Yet the literature on data governance has rarely looked beyond important pragmatic issues related to privacy and security. Less consideration has been given to unexpected or undesirable outcomes of healthcare in AI, such as clinician deskilling, algorithmic bias, the “regulatory vacuum”, and lack of public engagement. Amidst growing calls for ethical governance of algorithms, Reddy et al. developed a governance model for AI in healthcare delivery, focusing on principles of fairness, accountability, and transparency (FAT), and trustworthiness, and calling for wider discussion. Winter and Davidson emphasize the need to identify underlying values of healthcare data and use, noting the many competing interests and goals for use of health data—such as healthcare system efficiency and reform, patient and community health, intellectual property development, and monetization. Beyond the important considerations of privacy and security, governance must consider who will benefit from healthcare AI, and who will not. Whose values drive health AI innovation and use? How can we ensure that innovations are not limited to the wealthiest individuals or nations? As large technology companies begin to partner with health care systems, and as personally generated health data (PGHD) (e.g., fitness trackers, continuous glucose monitors, health information searches on the Internet) proliferate, who has oversight of these complex technical systems, which are essentially a black box? To tackle these complex and important issues, it is important to acknowledge that we have entered a new technical, organizational, and policy environment due to linked data, big data analytics, and AI. Data governance is no longer the responsibility of a single organization. Rather, multiple networked entities play a role and responsibilities may be blurred. This also raises many concerns related to data localization and jurisdiction—who is responsible for data governance? In this emerging environment, data may no longer be effectively governed through traditional policy models or instruments. 
    more » « less