skip to main content


Title: A Scalable Cloud-Based Analysis Platform for Survey Astronomy
Research in astronomy is undergoing a major paradigm shift, transformed by the advent of large, automated, sky-surveys into a data-rich field where multi-TB to PB-sized spatio-temporal data sets are commonplace. For example the Legacy Survey of Space and Time; LSST) is about to begin delivering observations of >10^10 objects, including a database with >4 x 10^13 rows of time series data. This volume presents a challenge: how should a domain-scientist with little experience in data management or distributed computing access data and perform analyses at PB-scale? We present a possible solution to this problem built on (adapted) industry standard tools and made accessible through web gateways. We have i) developed Astronomy eXtensions for Spark, AXS, a series of astronomy-specific modifications to Apache Spark allowing astronomers to tap into its computational scalability ii) deployed datasets in AXS-queriable format in Amazon S3, leveraging its I/O scalability, iii) developed a deployment of Spark on Kubernetes with auto-scaling configurations requiring no end-user interaction, and iv) provided a Jupyter notebook, web-accessible, front-end via JupyterHub including a rich library of pre-installed common astronomical software (accessible at http://hub.dirac.institute). We use this system to enable the analysis of data from the Zwicky Transient Facility, presently the closest precursor survey to the LSST, and discuss initial results. To our knowledge, this is a first application of cloud-based scalable analytics to astronomical datasets approaching LSST-scale. The code is available at https://github.com/astronomy-commons.  more » « less
Award ID(s):
2003196
NSF-PAR ID:
10282752
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Gateways 2020
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    We present a scalable, cloud-based science platform solution designed to enable next-to-the-data analyses of terabyte-scale astronomical tabular data sets. The presented platform is built on Amazon Web Services (over Kubernetes and S3 abstraction layers), utilizes Apache Spark and the Astronomy eXtensions for Spark for parallel data analysis and manipulation, and provides the familiar JupyterHub web-accessible front end for user access. We outline the architecture of the analysis platform, provide implementation details and rationale for (and against) technology choices, verify scalability through strong and weak scaling tests, and demonstrate usability through an example science analysis of data from the Zwicky Transient Facility’s 1Bn+ light-curve catalog. Furthermore, we show how this system enables an end user to iteratively build analyses (in Python) that transparently scale processing with no need for end-user interaction. The system is designed to be deployable by astronomers with moderate cloud engineering knowledge, or (ideally) IT groups. Over the past 3 yr, it has been utilized to build science platforms for the DiRAC Institute, the ZTF partnership, the LSST Solar System Science Collaboration, and the LSST Interdisciplinary Network for Collaboration and Computing, as well as for numerous short-term events (with over 100 simultaneous users). In a live demo instance, the deployment scripts, source code, and cost calculators are accessible.4

    http://hub.astronomycommons.org/

     
    more » « less
  2. null (Ed.)
    The Legacy Survey of Space and Time, operated by the Vera C. Rubin Observatory, is a 10-year astronomical survey due to start operations in 2022 that will image half the sky every three nights. LSST will produce ~20TB of raw data per night which will be calibrated and analyzed in almost real-time. Given the volume of LSST data, the traditional subset-download-process paradigm of data reprocessing faces significant challenges. We describe here, the first steps towards a gateway for astronomical science that would enable astronomers to analyze images and catalogs at scale. In this first step, we focus on executing the Rubin LSST Science Pipelines, a collection of image and catalog processing algorithms, on Amazon Web Services (AWS). We describe our initial impressions of the performance, scalability, and cost of deploying such a system in the cloud. 
    more » « less
  3. New international academic collaborations are being created at a fast pace, generating data sets each day, in the order of terabytes in size. Often these data sets need to be moved in real-time to a central location to be processed and then shared. In the field of astronomy, building data processing facilities in remote locations is not always feasible, creating the need for a high bandwidth network infrastructure to transport these data sets very long distances. This network infrastructure normally relies on multiple networks operated by multiple organizations or projects. Creating an end-to-end path involving multiple network operators, technologies and interconnections often adds conditions that make the real-time movement of big data sets challenging. The Large Synoptic Survey Telescope (LSST) is an example of astronomical applications imposing new challenges on multi-domain network provisioning activities. The network for LSST is challenging for a number of reasons: (1) with the telescope in Chile and the archiving facility in the USA, the network has a high propagation delay, which affects traditional transport protocols performance; (2) the path is composed of multiple network operators, which means that the different network operating teams involved must coordinate technologies and protocols to support all parallel data transfers in an efficient way; (3) the large amount of data produced (12.7GB/image) and the small interval available to transfer this data (5 seconds) to the archiving facility requires special Quality of Service (QoS) policies; (4) because network events happen, the network needs to be prepared to be adjusted for rainy days, where some data types will be prioritized over others. To guarantee data transfers will happen within the required interval, each network operator in the path needs to apply QoS policies to each of its network links. These policies need to be coordinated end-to-end and, in the case where the network is affected by parallel events, all policies might need to be dynamically reconfigured in real-time to accommodate specific QoS policies for rainy days. Reconfiguring QoS policies is a very complex activity to current network protocols and technologies, sometimes requiring human intervention. This presentation aims to share the efforts to guarantee an efficient network configuration capable of handling LSST data transfers in sunny and rainy days across multiple network operators from South to North America. 
    more » « less
  4. Abstract

    Light echoes (LEs) are the reflections of astrophysical transients off of interstellar dust. They are fascinating astronomical phenomena that enable studies of the scattering dust as well as of the original transients. LEs, however, are rare and extremely difficult to detect as they appear as faint, diffuse, time-evolving features. The detection of LEs still largely relies on human inspection of images, a method unfeasible in the era of large synoptic surveys. The Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) will generate an unprecedented amount of astronomical imaging data at high spatial resolution, exquisite image quality, and over tens of thousands of square degrees of sky: an ideal survey for LEs. However, the Rubin data processing pipelines are optimized for the detection of point sources and will entirely miss LEs. Over the past several years, artificial intelligence (AI) object-detection frameworks have achieved and surpassed real-time, human-level performance. In this work, we leverage a data set from the Asteroid Terrestrial-impact Last Alert System telescope to test a popular AI object-detection framework, You Only Look Once, or YOLO, developed by the computer-vision community, to demonstrate the potential of AI for the detection of LEs in astronomical images. We find that an AI framework can reach human-level performance even with a size- and quality-limited data set. We explore and highlight challenges, including class imbalance and label incompleteness, and road map the work required to build an end-to-end pipeline for the automated detection and study of LEs in high-throughput astronomical surveys.

     
    more » « less
  5. Abstract

    We present the Local Volume Complete Cluster Survey (LoVoCCS; we pronounce it as “low-vox” or “law-vox,” with stress on the second syllable), an NSF’s National Optical-Infrared Astronomy Research Laboratory survey program that uses the Dark Energy Camera to map the dark matter distribution and galaxy population in 107 nearby (0.03 <z< 0.12) X-ray luminous ([0.1–2.4 keV]LX500> 1044erg s−1) galaxy clusters that are not obscured by the Milky Way. The survey will reach Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST) Year 1–2 depth (for galaxiesr= 24.5,i= 24.0, signal-to-noise ratio (S/N) > 20;u= 24.7,g= 25.3,z= 23.8, S/N > 10) and conclude in ∼2023 (coincident with the beginning of LSST science operations), and will serve as a zeroth-year template for LSST transient studies. We process the data using the LSST Science Pipelines that include state-of-the-art algorithms and analyze the results using our own pipelines, and therefore the catalogs and analysis tools will be compatible with the LSST. We demonstrate the use and performance of our pipeline using three X-ray luminous and observation-time complete LoVoCCS clusters: A3911, A3921, and A85. A3911 and A3921 have not been well studied previously by weak lensing, and we obtain similar lensing analysis results for A85 to previous studies. (We mainly use A3911 to show our pipeline and give more examples in the Appendix.)

     
    more » « less