skip to main content

Title: When Directory Design Meets Data Explosion: Rethinking Query Performance for IoT
As IoT services scale up from single homes to smart cities, directories and mapping services are needed to manage potentially millions of devices. However, directory service providers will likely struggle to accommodate the increasing number of IoT devices, made more challenging by their heterogeneous metadata and the large volume of queries. One of the critical challenges, the high heterogeneity of IoT, is being addressed by a working standard of W3C, which formalizes a physical or virtual device as a formatted Thing Description (TD).We propose a local directory service architecture with a series of design requirements. With a focus on query performance, we build a proof-of-concept system to store metadata of IoT devices as TDs in terms of the working standard. A Raspberry Pi is configured to investigate the query performance of relational database and non-relational database as the classic choices for internal directories. Evaluation results demonstrate that compared with relational database, non-relational database can achieve 2.9 times higher resilience on property query and 2.35 times faster processing on spatial query, with mild loss on aggregation query.  more » « less
Award ID(s):
Author(s) / Creator(s):
Date Published:
Journal Name:
2020 International Symposium on Networks, Computers and Communications (ISNCC)
Page Range / eLocation ID:
1 to 6
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We propose and implement Directory-Based Access Control (DBAC), a flexible and systematic access control approach for geographically distributed multi-administration IoT systems. DBAC designs and relies on a particular module, IoT directory, to store device metadata, manage federated identities, and assist with cross-domain authorization. The directory service decouples IoT access into two phases: discover device information from directories and operate devices through discovered interfaces. DBAC extends attribute-based authorization and retrieves diverse attributes of users, devices, and environments from multi-faceted sources via standard methods, while user privacy is protected. To support resource-constrained devices, DBAC assigns a capability token to each authorized user, and devices only validate tokens to process a request. 
    more » « less
  2. Binder is a publicly accessible online service for executing interactive notebooks based on Git repositories. Binder dynamically builds and deploys containers following a recipe stored in the repository, then gives the user a browser-based notebook interface. The Binder group periodically releases a log of container launches from the public Binder service. Archives of launch records are available here. These records do not include identifiable information like IP addresses, but do give the source repo being launched along with some other metadata. The main content of this dataset is in the binder.sqlite file. This SQLite database includes launch records from 2018-11-03 to 2021-06-06 in the events table, which has the following schema.

    CREATE TABLE events( version INTEGER, timestamp TEXT, provider TEXT, spec TEXT, origin TEXT, ref TEXT, guessed_ref TEXT ); CREATE INDEX idx_timestamp ON events(timestamp);
    • version indicates the version of the record as assigned by Binder. The origin field became available with version 3, and the ref field with version 4. Older records where this information was not recorded will have the corresponding fields set to null.
    • timestamp is the ISO timestamp of the launch
    • provider gives the type of source repo being launched ("GitHub" is by far the most common). The rest of the explanations assume GitHub, other providers may differ.
    • spec gives the particular branch/release/commit being built. It consists of <github-id>/<repo>/<branch>.
    • origin indicates which backend was used. Each has its own storage, compute, etc. so this info might be important for evaluating caching and performance. Note that only recent records include this field. May be null.
    • ref specifies the git commit that was actually used, rather than the named branch referenced by spec. Note that this was not recorded from the beginning, so only the more recent entries include it. May be null.
    • For records where ref is not available, we attempted to clone the named reference given by spec rather than the specific commit (see below). The guessed_ref field records the commit found at the time of cloning. If the branch was updated since the container was launched, this will not be the exact version that was used, and instead will refer to whatever was available at the time (early 2021). Depending on the application, this might still be useful information. Selecting only records with version 4 (or non-null ref) will exclude these guessed commits. May be null.

    The Binder launch dataset identifies the source repos that were used, but doesn't give any indication of their contents. We crawled GitHub to get the actual specification files in the repos which were fed into repo2docker when preparing the notebook environments, as well as filesystem metadata of the repos. Some repos were deleted/made private at some point, and were thus skipped. This is indicated by the absence of any row for the given commit (or absence of both ref and guessed_ref in the events table). The schema is as follows.

    CREATE TABLE spec_files ( ref TEXT NOT NULL PRIMARY KEY, ls TEXT, runtime BLOB, apt BLOB, conda BLOB, pip BLOB, pipfile BLOB, julia BLOB, r BLOB, nix BLOB, docker BLOB, setup BLOB, postbuild BLOB, start BLOB );

    Here ref corresponds to ref and/or guessed_ref from the events table. For each repo, we collected spec files into the following fields (see the repo2docker docs for details on what these are). The records in the database are simply the verbatim file contents, with no parsing or further processing performed.

    • runtime: runtime.txt
    • apt: apt.txt
    • conda: environment.yml
    • pip: requirements.txt
    • pipfile: Pipfile.lock or Pipfile
    • julia: Project.toml or REQUIRE
    • r: install.R
    • nix: default.nix
    • docker: Dockerfile
    • setup:
    • postbuild: postBuild
    • start: start

    The ls field gives a metadata listing of the repo contents (excluding the .git directory). This field is JSON encoded with the following structure based on JSON types:

    • Object: filesystem directory. Keys are file names within it. Values are the contents, which can be regular files, symlinks, or subdirectories.
    • String: symlink. The string value gives the link target.
    • Number: regular file. The number value gives the file size in bytes.
    CREATE TABLE clean_specs ( ref TEXT NOT NULL PRIMARY KEY, conda_channels TEXT, conda_packages TEXT, pip_packages TEXT, apt_packages TEXT );

    The clean_specs table provides parsed and validated specifications for some of the specification files (currently Pip, Conda, and APT packages). Each column gives either a JSON encoded list of package requirements, or null. APT packages have been validated using a regex adapted from the repo2docker source. Pip packages have been parsed and normalized using the Requirement class from the pkg_resources package of setuptools. Conda packages have been parsed and normalized using the conda.models.match_spec.MatchSpec class included with the library form of Conda (distinct from the command line tool). Users might want to use these parsers when working with the package data, as the specifications can become fairly complex.

    The missing table gives the repos that were not accessible, and event_logs records which log files have already been added. These tables are used for updating the dataset and should not be of interest to users.

    more » « less
  3. The research data repository of the Environmental Data Initiative (EDI) is building on over 30 years of data curation research and experience in the National Science Foundation-funded US Long-Term Ecological Research (LTER) Network. It provides mature functionalities, well established workflows, and now publishes all ‘long-tail’ environmental data. High quality scientific metadata are enforced through automatic checks against community developed rules and the Ecological Metadata Language (EML) standard. Although the EDI repository is far along in making its data findable, accessible, interoperable, and reusable (FAIR), representatives from EDI and the LTER are developing best practices for the edge cases in environmental data publishing. One of these is the vast amount of imagery taken in the context of ecological research, ranging from wildlife camera traps to plankton imaging systems to aerial photography. Many images are used in biodiversity research for community analyses (e.g., individual counts, species cover, biovolume, productivity), while others are taken to study animal behavior and landscape-level change. Some examples from the LTER Network include: using photos of a heron colony to measure provisioning rates for chicks (Clarkson and Erwin 2018) or identifying changes in plant cover and functional type through time (Peters et al. 2020). Multi-spectral images are employed to identify prairie species. Underwater photo quads are used to monitor changes in benthic biodiversity (Edmunds 2015). Sosik et al. (2020) used a continuous Imaging FlowCytobot to identify and measure phyto- and microzooplankton. Cameras at McMurdo Dry Valleys assess snow and ice cover on Antarctic lakes allowing estimation of primary production (Myers 2019). It has been standard practice to publish numerical data extracted from images in EDI; however, the supporting imagery generally has not been made publicly available. Our goal in developing best practices for documenting and archiving these images is for them to be discovered and re-used. Our examples demonstrate several issues. The research questions, and hence, the image subjects are variable. Images frequently come in logical sets of time series. The size of such sets can be large and only some images may be contributed to a dedicated specialized repository. Finally, these images are taken in a larger monitoring context where many other environmental data are collected at the same time and location. Currently, a typical approach to publishing image data in EDI are packages containing compressed (ZIP or tar) files with the images, a directory manifest with additional image-specific metadata, and a package-level EML metadata file. Images in the compressed archive may be organized within directories with filenames corresponding to treatments, locations, time periods, individuals, or other grouping attributes. Additionally, the directory manifest table has columns for each attribute. Package-level metadata include standard coverage elements (e.g., date, time, location) and sampling methods. This approach of archiving logical ‘sets’ of images reduces the effort of providing metadata for each image when most information would be repeated, but at the expense of not making every image individually searchable. The latter may be overcome if the provided manifest contains standard metadata that would allow searching and automatic integration with other images. 
    more » « less
  4. We are storing and querying datasets with the private information of individuals at an unprecedented scale in settings ranging from IoT devices in smart homes to mining enormous collections of click trails for targeted advertising. Here, the privacy of the people described in these datasets is usually addressed as an afterthought, engineered on top of a DBMS optimized for performance. At best, these systems support security or managing access to sensitive data. This status quo has brought us a plethora of data breaches in the news. In response, governments are stepping in to enact privacy regulations such as the EU’s GDPR. We posit that there is an urgent need for trustworthy database system that offer end-to-end privacy guarantees for their records with user interfaces that closely resemble that of a relational database. As we shall see, these guarantees inform everything in the database’s design from how we store data to what query results we make available to untrusted clients. In this position paper we first define trustworthy database systems and put their research challenges in the context of relevant tools and techniques from the security community. We then use this backdrop to walk through the “life of a query” in a trustworthy database system. We start with the query parsing and follow the query’s path as the system plans, optimizes, and executes it. We highlight how we will need to rethink each step to make it efficient, robust, and usable for database clients. 
    more » « less
  5. Recent advances in Augmented Reality (AR) devices and their maturity as a technology offers new modalities for interaction between learners and their learning environments. Such capabilities are particularly important for learning that involves hands-on activities where there is a compelling need to: (a) make connections between knowledge-elements that have been taught at different times, (b) apply principles and theoretical knowledge in a concrete experimental setting, (c) understand the limitations of what can be studied via models and via experiments, (d) cope with increasing shortages in teaching-support staff and instructional material at the intersection of disciplines, and (e) improve student engagement in their learning. AR devices that are integrated into training and education systems can be effectively used to deliver just-in-time informatics to augment physical workspaces and learning environments with virtual artifacts. We present a system that demonstrates a solution to a critical registration problem and enables a multi-disciplinary team to develop the pedagogical content without the need for extensive coding. The most popular approach for developing AR applications is to develop a game using a standard game engine such as UNITY or UNREAL. These engines offer a powerful environment for developing a large variety of games and an exhaustive library of digital assets. In contrast, the framework we offer supports a limited range of human environment interactions that are suitable and effective for training and education. Our system offers four important capabilities – annotation, navigation, guidance, and operator safety. These capabilities are presented and described in detail. The above framework motivates a change of focus – from game development to AR content development. While game development is an intensive activity that involves extensive programming, AR content development is a multi-disciplinary activity that requires contributions from a large team of graphics designers, content creators, domain experts, pedagogy experts, and learning evaluators. We have demonstrated that such a multi-disciplinary team of experts working with our framework can use popular content creation tools to design and develop the virtual artifacts required for the AR system. These artifacts can be archived in a standard relational database and hosted on robust cloud-based backend systems for scale up. The AR content creators can own their content and Non-fungible Tokens to sequence the presentations either to improve pedagogical novelty or to personalize the learning. 
    more » « less