skip to main content


Title: A Data Science Platform to Enable Time-domain Astronomy
Abstract

SkyPortalis an open-source software package designed to discover interesting transients efficiently, manage follow-up, perform characterization, and visualize the results. By enabling fast access to archival and catalog data, crossmatching heterogeneous data streams, and the triggering and monitoring of on-demand observations for further characterization, aSkyPortal-based platform has been operating at scale for >2 yr for the Zwicky Transient Facility Phase II community, with hundreds of users, containing tens of millions of time-domain sources, interacting with dozens of telescopes, and enabling community reporting. WhileSkyPortalemphasizes rich user experiences across common front-end workflows, recognizing that scientific inquiry is increasingly performed programmatically,SkyPortalalso surfaces an extensive and well-documented application programming interface system. From back-end and front-end software to data science analysis tools and visualization frameworks, theSkyPortaldesign emphasizes the reuse and leveraging of best-in-class approaches, with a strong extensibility ethos. For instance,SkyPortalnow leverages ChatGPT large language models to generate and surface source-level human-readable summaries automatically. With the imminent restart of the next generation of gravitational-wave detectors,SkyPortalnow also includes dedicated multimessenger features addressing the requirements of rapid multimessenger follow-up: multitelescope management, team/group organizing interfaces, and crossmatching of multimessenger data streams with time-domain optical surveys, with interfaces sufficiently intuitive for newcomers to the field. This paper focuses on the detailed implementations, capabilities, and early science results that establishSkyPortalas a community software package ready to take on the data science challenges and opportunities presented by this next chapter in the multimessenger era.

 
more » « less
NSF-PAR ID:
10436922
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; more » ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; « less
Publisher / Repository:
DOI PREFIX: 10.3847
Date Published:
Journal Name:
The Astrophysical Journal Supplement Series
Volume:
267
Issue:
2
ISSN:
0067-0049
Format(s):
Medium: X Size: Article No. 31
Size(s):
["Article No. 31"]
Sponsoring Org:
National Science Foundation
More Like this
  1. We are now over four decades into digitally managing the names of Earth's species. As the number of federating (i.e., software that brings together previously disparate projects under a common infrastructure, for example TaxonWorks) and aggregating (e.g., International Plant Name Index, Catalog of Life (CoL)) efforts increase, there remains an unmet need for both the migration forward of old data, and for the production of new, precise and comprehensive nomenclatural catalogs. Given this context, we provide an overview of how TaxonWorks seeks to contribute to this effort, and where it might evolve in the future. In TaxonWorks, when we talk about governed names and relationships, we mean it in the sense of existing international codes of nomenclature (e.g., the International Code of Zoological Nomenclature (ICZN)). More technically, nomenclature is defined as a set of objective assertions that describe the relationships between the names given to biological taxa and the rules that determine how those names are governed. It is critical to note that this is not the same thing as the relationship between a name and a biological entity, but rather nomenclature in TaxonWorks represents the details of the (governed) relationships between names. Rather than thinking of nomenclature as changing (a verb commonly used to express frustration with biological nomenclature), it is useful to think of nomenclature as a set of data points, which grows over time. For example, when synonymy happens, we do not erase the past, but rather record a new context for the name(s) in question. The biological concept changes, but the nomenclature (names) simply keeps adding up. Behind the scenes, nomenclature in TaxonWorks is represented by a set of nodes and edges, i.e., a mathematical graph, or network (e.g., Fig. 1). Most names (i.e., nodes in the network) are what TaxonWorks calls "protonyms," monomial epithets that are used to construct, for example, bionomial names (not to be confused with "protonym" sensu the ICZN). Protonyms are linked to other protonyms via relationships defined in NOMEN, an ontology that encodes governed rules of nomenclature. Within the system, all data, nodes and edges, can be cited, i.e., linked to a source and therefore anchored in time and tied to authorship, and annotated with a variety of annotation types (e.g., notes, confidence levels, tags). The actual building of the graphs is greatly simplified by multiple user-interfaces that allow scientists to review (e.g. Fig. 2), create, filter, and add to (again, not "change") the nomenclatural history. As in any complex knowledge-representation model, there are outlying scenarios, or edge cases that emerge, making certain human tasks more complex than others. TaxonWorks is no exception, it has limitations in terms of what and how some things can be represented. While many complex representations are hidden by simplified user-interfaces, some, for example, the handling of the ICZN's Family-group name, batch-loading of invalid relationships, and comparative syncing against external resources need more work to simplify the processes presently required to meet catalogers' needs. The depth at which TaxonWorks can capture nomenclature is only really valuable if it can be used by others. This is facilitated by the application programming interface (API) serving its data (https://api.taxonworks.org), serving text files, and by exports to standards like the emerging Catalog of Life Data Package. With reference to real-world problems, we illustrate different ways in which the API can be used, for example, as integrated into spreadsheets, through the use of command line scripts, and serve in the generation of public-facing websites. Behind all this effort are an increasing number of people recording help videos, developing documentation, and troubleshooting software and technical issues. Major contributions have come from developers at many skill levels, from high school to senior software engineers, illustrating that TaxonWorks leads in enabling both technical and domain-based contributions. The health and growth of this community is a key factor in TaxonWork's potential long-term impact in the effort to unify the names of Earth's species. 
    more » « less
  2. null (Ed.)
    The DeepLearningEpilepsyDetectionChallenge: design, implementation, andtestofanewcrowd-sourced AIchallengeecosystem Isabell Kiral*, Subhrajit Roy*, Todd Mummert*, Alan Braz*, Jason Tsay, Jianbin Tang, Umar Asif, Thomas Schaffter, Eren Mehmet, The IBM Epilepsy Consortium◊ , Joseph Picone, Iyad Obeid, Bruno De Assis Marques, Stefan Maetschke, Rania Khalaf†, Michal Rosen-Zvi† , Gustavo Stolovitzky† , Mahtab Mirmomeni† , Stefan Harrer† * These authors contributed equally to this work † Corresponding authors: rkhalaf@us.ibm.com, rosen@il.ibm.com, gustavo@us.ibm.com, mahtabm@au1.ibm.com, sharrer@au.ibm.com ◊ Members of the IBM Epilepsy Consortium are listed in the Acknowledgements section J. Picone and I. Obeid are with Temple University, USA. T. Schaffter is with Sage Bionetworks, USA. E. Mehmet is with the University of Illinois at Urbana-Champaign, USA. All other authors are with IBM Research in USA, Israel and Australia. Introduction This decade has seen an ever-growing number of scientific fields benefitting from the advances in machine learning technology and tooling. More recently, this trend reached the medical domain, with applications reaching from cancer diagnosis [1] to the development of brain-machine-interfaces [2]. While Kaggle has pioneered the crowd-sourcing of machine learning challenges to incentivise data scientists from around the world to advance algorithm and model design, the increasing complexity of problem statements demands of participants to be expert data scientists, deeply knowledgeable in at least one other scientific domain, and competent software engineers with access to large compute resources. People who match this description are few and far between, unfortunately leading to a shrinking pool of possible participants and a loss of experts dedicating their time to solving important problems. Participation is even further restricted in the context of any challenge run on confidential use cases or with sensitive data. Recently, we designed and ran a deep learning challenge to crowd-source the development of an automated labelling system for brain recordings, aiming to advance epilepsy research. A focus of this challenge, run internally in IBM, was the development of a platform that lowers the barrier of entry and therefore mitigates the risk of excluding interested parties from participating. The challenge: enabling wide participation With the goal to run a challenge that mobilises the largest possible pool of participants from IBM (global), we designed a use case around previous work in epileptic seizure prediction [3]. In this “Deep Learning Epilepsy Detection Challenge”, participants were asked to develop an automatic labelling system to reduce the time a clinician would need to diagnose patients with epilepsy. Labelled training and blind validation data for the challenge were generously provided by Temple University Hospital (TUH) [4]. TUH also devised a novel scoring metric for the detection of seizures that was used as basis for algorithm evaluation [5]. In order to provide an experience with a low barrier of entry, we designed a generalisable challenge platform under the following principles: 1. No participant should need to have in-depth knowledge of the specific domain. (i.e. no participant should need to be a neuroscientist or epileptologist.) 2. No participant should need to be an expert data scientist. 3. No participant should need more than basic programming knowledge. (i.e. no participant should need to learn how to process fringe data formats and stream data efficiently.) 4. No participant should need to provide their own computing resources. In addition to the above, our platform should further • guide participants through the entire process from sign-up to model submission, • facilitate collaboration, and • provide instant feedback to the participants through data visualisation and intermediate online leaderboards. The platform The architecture of the platform that was designed and developed is shown in Figure 1. The entire system consists of a number of interacting components. (1) A web portal serves as the entry point to challenge participation, providing challenge information, such as timelines and challenge rules, and scientific background. The portal also facilitated the formation of teams and provided participants with an intermediate leaderboard of submitted results and a final leaderboard at the end of the challenge. (2) IBM Watson Studio [6] is the umbrella term for a number of services offered by IBM. Upon creation of a user account through the web portal, an IBM Watson Studio account was automatically created for each participant that allowed users access to IBM's Data Science Experience (DSX), the analytics engine Watson Machine Learning (WML), and IBM's Cloud Object Storage (COS) [7], all of which will be described in more detail in further sections. (3) The user interface and starter kit were hosted on IBM's Data Science Experience platform (DSX) and formed the main component for designing and testing models during the challenge. DSX allows for real-time collaboration on shared notebooks between team members. A starter kit in the form of a Python notebook, supporting the popular deep learning libraries TensorFLow [8] and PyTorch [9], was provided to all teams to guide them through the challenge process. Upon instantiation, the starter kit loaded necessary python libraries and custom functions for the invisible integration with COS and WML. In dedicated spots in the notebook, participants could write custom pre-processing code, machine learning models, and post-processing algorithms. The starter kit provided instant feedback about participants' custom routines through data visualisations. Using the notebook only, teams were able to run the code on WML, making use of a compute cluster of IBM's resources. The starter kit also enabled submission of the final code to a data storage to which only the challenge team had access. (4) Watson Machine Learning provided access to shared compute resources (GPUs). Code was bundled up automatically in the starter kit and deployed to and run on WML. WML in turn had access to shared storage from which it requested recorded data and to which it stored the participant's code and trained models. (5) IBM's Cloud Object Storage held the data for this challenge. Using the starter kit, participants could investigate their results as well as data samples in order to better design custom algorithms. (6) Utility Functions were loaded into the starter kit at instantiation. This set of functions included code to pre-process data into a more common format, to optimise streaming through the use of the NutsFlow and NutsML libraries [10], and to provide seamless access to the all IBM services used. Not captured in the diagram is the final code evaluation, which was conducted in an automated way as soon as code was submitted though the starter kit, minimising the burden on the challenge organising team. Figure 1: High-level architecture of the challenge platform Measuring success The competitive phase of the "Deep Learning Epilepsy Detection Challenge" ran for 6 months. Twenty-five teams, with a total number of 87 scientists and software engineers from 14 global locations participated. All participants made use of the starter kit we provided and ran algorithms on IBM's infrastructure WML. Seven teams persisted until the end of the challenge and submitted final solutions. The best performing solutions reached seizure detection performances which allow to reduce hundred-fold the time eliptologists need to annotate continuous EEG recordings. Thus, we expect the developed algorithms to aid in the diagnosis of epilepsy by significantly shortening manual labelling time. Detailed results are currently in preparation for publication. Equally important to solving the scientific challenge, however, was to understand whether we managed to encourage participation from non-expert data scientists. Figure 2: Primary occupation as reported by challenge participants Out of the 40 participants for whom we have occupational information, 23 reported Data Science or AI as their main job description, 11 reported being a Software Engineer, and 2 people had expertise in Neuroscience. Figure 2 shows that participants had a variety of specialisations, including some that are in no way related to data science, software engineering, or neuroscience. No participant had deep knowledge and experience in data science, software engineering and neuroscience. Conclusion Given the growing complexity of data science problems and increasing dataset sizes, in order to solve these problems, it is imperative to enable collaboration between people with differences in expertise with a focus on inclusiveness and having a low barrier of entry. We designed, implemented, and tested a challenge platform to address exactly this. Using our platform, we ran a deep-learning challenge for epileptic seizure detection. 87 IBM employees from several business units including but not limited to IBM Research with a variety of skills, including sales and design, participated in this highly technical challenge. 
    more » « less
  3. Abstract

    The discovery of the electromagnetic counterpart to the binary neutron star (NS) merger GW170817 has opened the era of gravitational-wave multimessenger astronomy. Rapid identification of the optical/infrared kilonova enabled a precise localization of the source, which paved the way to deep multiwavelength follow-up and its myriad of related science results. Fully exploiting this new territory of exploration requires the acquisition of electromagnetic data from samples of NS mergers and other gravitational-wave sources. After GW170817, the frontier is now to map the diversity of kilonova properties and provide more stringent constraints on the Hubble constant, and enable new tests of fundamental physics. The Vera C. Rubin Observatory’s Legacy Survey of Space and Time can play a key role in this field in the 2020s, when an improved network of gravitational-wave detectors is expected to reach a sensitivity that will enable the discovery of a high rate of merger events involving NSs (∼tens per year) out to distances of several hundred megaparsecs. We design comprehensive target-of-opportunity observing strategies for follow-up of gravitational-wave triggers that will make the Rubin Observatory the premier instrument for discovery and early characterization of NS and other compact-object mergers, and yet unknown classes of gravitational-wave events.

     
    more » « less
  4. Abstract

    The modern study of astrophysical transients has been transformed by an exponentially growing volume of data. Within the last decade, the transient discovery rate has increased by a factor of ∼20, with associated survey data, archival data, and metadata also increasing with the number of discoveries. To manage the data at this increased rate, we require new tools. Here we presentYSE-PZ, a transient survey management platform that ingests multiple live streams of transient discovery alerts, identifies the host galaxies of those transients, downloads coincident archival data, and retrieves photometry and spectra from ongoing surveys.YSE-PZalso presents a user with a range of tools to make and support timely and informed transient follow-up decisions. Those subsequent observations enhance transient science and can reveal physics only accessible with rapid follow-up observations. Rather than automating out human interaction,YSE-PZfocuses on accelerating and enhancing human decision making, a role we describe as empowering the human-in-the-loop. Finally,YSE-PZis built to be flexibly used and deployed;YSE-PZcan support multiple, simultaneous, and independent transient collaborations through group-level data permissions, allowing a user to view the data associated with the union of all groups in which they are a member.YSE-PZcan be used as a local instance installed via Docker or deployed as a service hosted in the cloud. We provideYSE-PZas an open-source tool for the community.

     
    more » « less
  5. Abstract

    We present upgraded infrastructure for Searches After Gravitational waves Using ARizona Observatories (SAGUARO) during LIGO, Virgo, and KAGRA’s fourth gravitational-wave (GW) observing run (O4). These upgrades implement many of the lessons we learned after a comprehensive analysis of potential electromagnetic counterparts to the GWs discovered during the previous observing run. We have developed a new web-based target and observation manager (TOM) that allows us to coordinate sky surveys, vet potential counterparts, and trigger follow-up observations from one centralized portal. The TOM includes software that aggregates all publicly available information on the light curves and possible host galaxies of targets, allowing us to rule out potential contaminants like active galactic nuclei, variable stars, solar system objects, and preexisting supernovae, as well as to assess the viability of any plausible counterparts. We have also upgraded our image-subtraction pipeline by assembling deeper reference images and training a new neural-network-based real–bogus classifier. These infrastructure upgrades will aid coordination by enabling the prompt reporting of observations, discoveries, and analysis to the GW follow-up community, and put SAGUARO in an advantageous position to discover kilonovae in the remainder of O4 and beyond. Many elements of our open-source software stack have broad utility beyond multimessenger astronomy, and will be particularly relevant in the “big data” era of transient discoveries by the Vera C. Rubin Observatory.

     
    more » « less