skip to main content


Title: Data Reduction Integrated Python Protocol for the Arecibo Pisces-Perseus Supercluster Survey(DRIPP for APPSS)
Developments in open-source high-level programming languages enable undergraduate students to make vital contributions to modern astronomical surveys. The Arecibo Pisces-Perseus Supercluster Survey (APPSS) currently uses data analysis software written in Interactive Data Language (IDL). We discuss the conversion of this software to the Python programming language, which uses freely available standard libraries, and the conversion of the data to a standard form of the Single-Dish FITS (SDFITS) standard. Data Reduction Integrated Python Protocol (DRIPP) provides user-guided data reduction with an interface similar to the former software written in IDL. Converting to DRIPP would provide researchers with more accessible data processing capabilities for APPSS (or any similar radio spectral survey). This work has been supported by NSF AST-1637339.  more » « less
Award ID(s):
1637339
NSF-PAR ID:
10168217
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
American Astronomical Society meeting
Volume:
235
ISSN:
2152-887X
Page Range / eLocation ID:
279.11
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The Arecibo Pisces-Perseus Supercluster Survey (APPSS) is an observing project undertaken by the Undergraduate ALFALFA Team that aims to detect HI in galaxies in the Pisces-Perseus neighborhood and analyze the dynamics and the properties of the galaxies. The galaxies targeted in APPSS are suspected from their optical properties (color, morphology, surface brightness) to lie in the Pisces-Perseus Supercluster (PPS) but are below the detection threshold of the ALFALFA blind HI survey. Here we present results for galaxies targeted in a strip across the PPS region in declination from 30o to 32o. This region is along the main filament of the supercluster and includes objects such as the Pisces Cluster. The data was recorded by the L-Band Wide receiver of the Arecibo Observatory. Data reduction was done using routines derived for the APPSS in IDL. After baselining the spectra and sifting out radio interference, we fit either a gaussian or two-horned profile to their 21-centimeter line to measure the HI line flux density, velocity, and velocity width. From these parameters we calculate distances, hydrogen gas mass, and rotational velocities. As expected, the galaxies analyzed in this slice of declination have consistently lower mass than the ALFALFA detections thus extending the sampling of galaxies within the PPS. The combined ALFALFA and APPSS HI line detections will be used for future applications of the Baryonic Tully-Fisher Relation in this region. This research has been supported by NSF grant NSF/AST-1714828 to M.P. Haynes and by the Brinson Foundation for the Arecibo Pisces-Perseus Supercluster Survey (APPSS). 
    more » « less
  2. Testing is an integral but often neglected part of the software development process. Classical test generation tools such as EvoSuite generate behavioral test suites by optimizing for coverage, but tend to produce tests that are hard to understand. Language models trained on code can generate code that is highly similar to that written by humans, but current models are trained to generate each file separately, as is standard practice in natural language processing, and thus fail to consider the code- under-test context when producing a test file. In this work, we propose the Aligned Code And Tests Language Model (CAT- LM), a GPT-style language model with 2.7 Billion parameters, trained on a corpus of Python and Java projects. We utilize a novel pretraining signal that explicitly considers the mapping between code and test files when available. We also drastically increase the maximum sequence length of inputs to 8,192 tokens, 4x more than typical code generation models, to ensure that the code context is available to the model when generating test code. We analyze its usefulness for realistic applications, showing that sampling with filtering (e.g., by compilability, coverage) allows it to efficiently produce tests that achieve coverage similar to ones written by developers while resembling their writing style. By utilizing the code context, CAT-LM generates more valid tests than even much larger language models trained with more data (CodeGen 16B and StarCoder) and substantially outperforms a recent test-specific model (TeCo) at test completion. Overall, our work highlights the importance of incorporating software-specific insights when training language models for code and paves the way to more powerful automated test generation. 
    more » « less
  3. PmagPy Online: Jupyter Notebooks, the PmagPy Software Package and the Magnetics Information Consortium (MagIC) Database Lisa Tauxe$^1$, Rupert Minnett$^2$, Nick Jarboe$^1$, Catherine Constable$^1$, Anthony Koppers$^2$, Lori Jonestrask$^1$, Nick Swanson-Hysell$^3$ $^1$Scripps Institution of Oceanography, United States of America; $^2$ Oregon State University; $^3$ University of California, Berkely; ltauxe@ucsd.edu The Magnetics Information Consortium (MagIC), hosted at http://earthref.org/MagIC is a database that serves as a Findable, Accessible, Interoperable, Reusable (FAIR) archive for paleomagnetic and rock magnetic data. It has a flexible, comprehensive data model that can accomodate most kinds of paleomagnetic data. The PmagPy software package is a cross-platform and open-source set of tools written in Python for the analysis of paleomagnetic data that serves as one interface to MagIC, accommodating various levels of user expertise. It is available through github.com/PmagPy. Because PmagPy requires installation of Python, several non-standard Python modules, and the PmagPy software package, there is a speed bump for many practitioners on beginning to use the software. In order to make the software and MagIC more accessible to the broad spectrum of scientists interested in paleo and rock magnetism, we have prepared a set of Jupyter notebooks, hosted on jupyterhub.earthref.org which serve a set of purposes. 1) There is a complete course in Python for Earth Scientists, 2) a set of notebooks that introduce PmagPy (pulling the software package from the github repository) and illustrate how it can be used to create data products and figures for typical papers, and 3) show how to prepare data from the laboratory to upload into the MagIC database. The latter will satisfy expectations from NSF for data archiving and for example the AGU publication data archiving requirements. Getting started To use the PmagPy notebooks online, go to website at https://jupyterhub.earthref.org/. Create an Earthref account using your ORCID and log on. [This allows you to keep files in a private work space.] Open the PmagPy Online - Setup notebook and execute the two cells. Then click on File = > Open and click on the PmagPy_Online folder. Open the PmagPy_online notebook and work through the examples. There are other notebooks that are useful for the working paleomagnetist. Alternatively, you can install Python and the PmagPy software package on your computer (see https://earthref.org/PmagPy/cookbook for instructions). Follow the instructions for "Full PmagPy install and update" through section 1.4 (Quickstart with PmagPy notebooks). This notebook is in the collection of PmagPy notebooks. Overview of MagIC The Magnetics Information Consortium (MagIC), hosted at http://earthref.org/MagIC is a database that serves as a Findable, Accessible, Interoperable, Reusable (FAIR) archive for paleomagnetic and rock magnetic data. Its datamodel is fully described here: https://www2.earthref.org/MagIC/data-models/3.0. Each contribution is associated with a publication via the DOI. There are nine data tables: contribution: metadata of the associated publication. locations: metadata for locations, which are groups of sites (e.g., stratigraphic section, region, etc.) sites: metadata and derived data at the site level (units with a common expectation) samples: metadata and derived data at the sample level. specimens: metadata and derived data at the specimen level. criteria: criteria by which data are deemed acceptable ages: ages and metadata for sites/samples/specimens images: associated images and plots. Overview of PmagPy The functionality of PmagPy is demonstrated within notebooks in the PmagPy repository: PmagPy_online.ipynb: serves as an introdution to PmagPy and MagIC (this conference). It highlights the link between PmagPy and the Findable Accessible Interoperable Reusabe (FAIR) database maintained by the Magnetics Information Consortium (MagIC) at https://earthref.org/MagIC. Other notebooks of interest are: PmagPy_calculations.ipynb: demonstrates many of the PmagPy calculation functions such as those that rotate directions, return statistical parameters, and simulate data from specified distributions. PmagPy_plots_analysis.ipynb: demonstrates PmagPy functions that can be used to visual data as well as those that conduct statistical tests that have associated visualizations. PmagPy_MagIC.ipynb: demonstrates how PmagPy can be used to read and write data to and from the MagIC database format including conversion from many individual lab measurement file formats. Please see also our YouTube channel with more presentations from the 2020 MagIC workshop here: https://www.youtube.com/playlist?list=PLirL2unikKCgUkHQ3m8nT29tMCJNBj4kj 
    more » « less
  4. A great part of software development involves conceptualizing or communicating the underlying procedures and logic that needs to be expressed in programs. One major difficulty of programming is turning concept into code , especially when dealing with the APIs of unfamiliar libraries. Recently, there has been a proliferation of machine learning methods for code generation and retrieval from natural language queries , but these have primarily been evaluated purely based on retrieval accuracy or overlap of generated code with developer-written code, and the actual effect of these methods on the developer workflow is surprisingly unattested. In this article, we perform the first comprehensive investigation of the promise and challenges of using such technology inside the PyCharm IDE, asking, “At the current state of technology does it improve developer productivity or accuracy, how does it affect the developer experience, and what are the remaining gaps and challenges?” To facilitate the study, we first develop a plugin for the PyCharm IDE that implements a hybrid of code generation and code retrieval functionality, and we orchestrate virtual environments to enable collection of many user events (e.g., web browsing, keystrokes, fine-grained code edits). We ask developers with various backgrounds to complete 7 varieties of 14 Python programming tasks ranging from basic file manipulation to machine learning or data visualization, with or without the help of the plugin. While qualitative surveys of developer experience are largely positive, quantitative results with regards to increased productivity, code quality, or program correctness are inconclusive. Further analysis identifies several pain points that could improve the effectiveness of future machine learning-based code generation/retrieval developer assistants and demonstrates when developers prefer code generation over code retrieval and vice versa. We release all data and software to pave the road for future empirical studies on this topic, as well as development of better code generation models. 
    more » « less
  5. Abstract Analysis description languages are declarative interfaces for HEP data analysis that allow users to avoid writing event loops, simplify code, and enable performance improvements to be decoupled from analysis development. One example is FuncADL, inspired by functional programming and developed using Python as a host language. FuncADL borrows concepts from database query languages to isolate the interface from the underlying physical and logical schemas. The same query can be used to select data from different sources and formats and with different execution mechanisms. FuncADL is one of the tools being developed by IRIS-HEP for highly scalable physics analysis for the LHC and HL-LHC. FuncADL is demonstrated by implementing example analysis tasks designed by HSF and IRIS-HEP. Another language example is ADL, which expresses the physics content of an analysis in a standard and unambiguous way, independent of computing frameworks. In ADL, analyses are described in human-readable text files composed of blocks with a keyword-expression structure. Two infrastructures are available to render ADL executable: CutLang, a runtime interpreter written in C++; and adl2tnm, a transpiler converting ADL into C++ or Python code. ADL/CutLang are already used in several physics studies and educational projects, and are adapted for use with LHC Open Data. 
    more » « less