<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Tapis-CHORDS Integration: Time-Series Data Support in Science Gateway Infrastructure</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2019 October</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10143800</idno>
					<idno type="doi"></idno>
					<title level='j'>Conference: Science Gateways 2019At: San Diego, CA</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>SB Cleveland</author><author>A Jamthey</author><author>S Padhyy</author><author>J Powelly</author><author>J Stubbs</author><author>MD Daniels</author><author>SA Pierce</author><author>GA Jacobs</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The explosion of IoT devices and sensors in recent years has led to a demand for efficiently storing, processing and analyzing time-series data. Geoscience researchers use time-series data stores such as Hydroserver, VOEIS and CHORDS. Many of these tools require a great deal of infrastructure to deploy and expertise to manage and scale. Tapis's (formerly known as Agave) platform as a service provides a way to support researchers in a way that they are not responsible for the infrastructure and can focus on the science. The University of Hawaii (UH) and Texas Advanced Computing Center (TACC) have collaborated to develop a new API integration that combines Tapis with the CHORDS time series data service to support projects at both institutions for storing, annotating and querying time-series data. This new Streams API leverages the strengths of both the Tapis platform and CHORDS service to enable capabilities for supporting time-series data streams not available in either tool alone. These new capabilities may be leveraged by Tapis powered science gateways with needs for handling spatially indexed time-series data-sets for their researchers as they have been at UH and TACC.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>The explosion of IoT devices and sensors in recent years has lead to a demand for support of storing, processing and analyzing time-series data. Further, as more ML and AI systems are developed and come online the need for reenforcement data and dataset that can be used for training as well as input for driving events is increasing. A number of technologies have sprung up for storing large amounts of log data (Elasticsearch, Splunk,Apache Flume) and process data streams (Apache Kafka, Apache Storm, Apache Fink). Geoscience researchers have been using time-series data stores such as the Hydroserver [1],Virtual Observatory and Ecological Informatics System (VOEIS) <ref type="bibr">[2]</ref> and the Cloud-Hosted Real-time Data Service (CHORDS) <ref type="bibr">[3]</ref> for years. Many of these tools can be powerful but also require a great deal of infrastructure overhead to deploy and expertise to manage and scale.</p><p>Presented at Gateways 2019, San Diego, CA, USA, September 2325, 2019. <ref type="url">https://osf.io/meetings/gateways2019/</ref> To support streaming data for science, UH and TACC have developed APIs and infrastructure, the Streams API, to enable the integration of streaming data workflows, storage and retrieval with temporal and spatial support. These features pave the way to automated data stream processing workflows upon ingestion and integrate metadata for organization and robust data curation needed for dissemination and wider research community discovery and re-use. In this paper, we present our work to integrate existing CHORDS services to support streaming sensor data in Tapis [4], <ref type="bibr">[5]</ref>. We start with an overview of Tapis metadata/data features and CHORDS and look at the synergy that can be enabled. We then present our new proof of concept for the Streams (Tapis-CHORDS integration) API and discuss the performance, challenges and opportunities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. BACKGROUND</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Tapis</head><p>Tapis is an open source, platform-as-a-service for hybrid cloud computing, data management and reproducible science. Tapis uses standards-based technologies and community promoted best practices to enable users to run code, manage data, collaborate meaningfully, and integrate anywhere. Tapis has been in production as the middleware that currently powers a number of community science gateways. Tapis is a multitenant, cloud-native distributed system. All services within the platform run as Docker containers, orchestrated as a single microservice architecture. The platform can be viewed as three logical tiers: platform services, science APIs, and support services. Platform Services contain services providing identity and API management, client registration, tenant admin services, and documentation. Science APIs contain the primary Science as a Service (ScaaS) functionality used to power science gateways such as data and job management, app publishing, notifications, etc. This tier is further divided into a frontend collection of loosely coupled services, microservices, and backend service workers. The microservices serve the user HTTP requests to the REST APIs, and service workers handle processing of asynchronous requests such as data movement and app publishing. Support services include databases, message queues, caches, object stores, service discovery, notification relays, and websocket proxies. A tenant represents a group of users, applications, and entitlements. Each instance of Tapis can support one or more tenants. Individual tiers and microservices can be replicated, configured, and scaled in support of a specific tenant, or shared to better leverage a single, consolidated resource footprint.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. CHORDS</head><p>CHORDS is a real-time data services infrastructure that provides an easy-to-use system for acquiring, navigating and distributing real-time data streams via cloud services and the Internet. CHORDS can lower the barrier to these services for small instrument teams, employ data and metadata formats that adhere to community accepted standards, and broaden access to real-time data for the geosciences community. The CHORDS Portal is a Ruby on Rails web application and database that accepts real-time data from distributed instruments, and serves the measurements to anyone on the Internet. The data streams are pushed to and pulled from the CHORDS portal InfluxDB store using simple HTTP requests. Management tool allow users to monitor remote instruments, ensure correct operation, and maximize data collection. A rolling archive enables scientists and analysts to easily fetch the data in real-time, delivered directly to browsers, programs and mobile apps.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Motivation</head><p>The need to integrate existing CHORDS services to support streaming sensor data in Tapis, emerged at the 2018 workshop for the EarthCube Research Coordination Network for Intelligent Systems for Geosciences (IS-GEO RCN) to address the use cases from various domain scientists in data science, geoscience and informatics, particularly at UH and TACC. UH data scientists want the ability to develop reproducible novel ML/AI models that leverage advanced CI to analyze big data, streaming data and produce secondary data and knowledge products. Similarly, the Planet Texas 2050 group at TACC wish to leverage spatially dense and ever-increasing temporal datasets generated by Arduino-based microcontrollers as ground-truth inputs for integrated models of water-landatmosphere-urban systems. Deployments collect hydrological and atmospheric measurements to feed models describing various processes related to flooding and aquifer recharge. Tapis-Chords integration stemmed as common solution to both institutional use cases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. EARLY ADOPTERS/USE CASES</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Ike Wai Project</head><p>The Ike Wai Gateway, water science gateway developed at the University of Hawaii (UH), Ike means knowledge and Wai means water in Hawaiian <ref type="bibr">[7]</ref>. The gateway supports research in hydrology and water management, providing tools to address questions of water sustainability in Hawaii. The gateway provides a centralized web based user interfaces and APIs supporting multi-domain data management, computation, analysis and visualization tools to support reproducible science, modeling, data discovery and decision support for the Hawaii EPSCoR Ike Wai research team and wider Hawaii hydrology community. By leveraging the Tapis platform, UH has constructed a gateway that ties data and advanced computing resources together to support diverse research domains including microbiology, geochemistry, geophysics, economics and humanities, coupled with computational and modeling workflows delivered in a user friendly web interface and REST APIs (Fig <ref type="figure">1</ref>). The Ike Wai project has deployed a number of sensors for measuring wells, rainfall and submarine groundwater discharge. Researchers desire the ability to access some of the data in real-time and in a serialized, queryable manner. Current support for these data is via annotated log files and spreadsheets and data extraction of particular measurement types, for instance chloride concentration, requires manually performing ETL workflows to create the desired dataset that spans multiple spatial locations. The ability to subset and combine data across distributed spatial locations is a must for groundwater modeling and investigations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Planet Texas 2050</head><p>By the year 2050, the state of Texas is forecast to increase in population from 28 million to nearly 55 million residents. As a result, the effects of present utilization in the sustainability of natural resources (water, energy, and land-use) must be modeled and made available to policymakers. The Planet Texas 2050 (PT2050) project is designed to address knowledge and information needed to inform and support resilient responses in the face of identified vulnerabilities. The DataX Science Gateway, built on Tapis, is in development as part of the PT2050 initiative, to provide a platform through which scientists, data analysts, and policymakers collaborate to generate cross-disciplinary environmental models. The scientists and analysts creating the hybridized models will have unique access to both datasets, workflow generation tools, and collaborators historically partitioned across disciplines. The DataX Gateway enables the ingestion, data transformations and composition of integrated models. Core capabilities within the data portal include tools for assimilating disparate datasets, pre-processing data sources for inclusion in integrated models, and sharing through the community with access to large scale resources including storage, and computational capabilities at the Texas Advanced Computing Center. Generally, integrated models use static datasets. The purpose of this research was to explore a method by which real-time in-situ environmental edge monitoring systems could stream data into backend models for processing. The real-time data serves as a groundtruth source of information for models and expands the spectrum of possible use cases the DataX Gateway could support.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. IMPLEMENTATION</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Tapis Metadata Changes</head><p>Tapis's Metadata service utilizes MongoDB as the database for storing metadata JSON documents in a Mongo collection</p></div>		</body>
		</text>
</TEI>
