- Award ID(s):
- 1640575
- NSF-PAR ID:
- 10298544
- Date Published:
- Journal Name:
- 33rd International Conference on Scientific and Statistical Database Management
- Page Range / eLocation ID:
- 109 to 120
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
null (Ed.)Datasets are often derived by manipulating raw data with statistical software packages. The derivation of a dataset must be recorded in terms of both the raw input and the manipulations applied to it. Statistics packages typically provide limited help in documenting provenance for the resulting derived data. At best, the operations performed by the statistical package are described in a script. Disparate representations make these scripts hard to understand for users. To address these challenges, we created Continuous Capture of Metadata (C2Metadata), a system to capture data transformations in scripts for statistical packages and represent it as metadata in a standard format that is easy to understand. We do so by devising a Structured Data Transformation Algebra (SDTA), which uses a small set of algebraic operators to express a large fraction of data manipulation performed in practice. We then implement SDTA, inspired by relational algebra, in a data transformation specification language we call SDTL. In this demonstration, we showcase C2Metadata’s capture of data transformations from a pool of sample transformation scripts in at least two languages: SPSS®and Stata®(SAS®and R are under development), for social science data in a large academic repository. We will allow the audience to explore C2Metadata using a web-based interface, visualize the intermediate steps and trace the provenance and changes of data at different levels for better understanding of the process.more » « less
-
null (Ed.)Structured Data Transformation Language (SDTL) provides structured, machine actionable representations of data transformation commands found in statistical analysis software. The Continuous Capture of Metadata for Statistical Data Project (C2Metadata) created SDTL as part of an automated system that captures provenance metadata from data transformation scripts and adds variable derivations to standard metadata files. SDTL also has potential for auditing scripts and for translating scripts between languages. SDTL is expressed in a set of JSON schemas, which are machine actionable and easily serialized to other formats. Statistical software languages have a number of special features that have been carried into SDTL. We explain how SDTL handles differences among statistical languages and complex operations, such as merging files and reshaping data tables from “wide” to “long”.more » « less
-
A native cross-platform mobile app has multiple platform-specific implementations. Typically, an app is developed for one platform and then ported to the remaining ones. Translating an app from one language (e.g., Java) to another (e.g., Swift) by hand is tedious and error-prone, while automated translators either require manually defined translation rules or focus on translating APIs. To automate the translation of native cross-platform apps, we present J2SINFERER, a novel approach that iteratively infers syntactic transformation rules and API mappings from Java to Swift. Given a software corpus in both languages, J2SLNFERER first identifies the syntactically equivalent code based on braces and string similarity. For each pair of similar code segments, J2SLNFERER then creates syntax trees of both languages, leveraging the minimalist domain knowledge of language correspondence (e.g., operators and markers) to iteratively align syntax tree nodes, and to infer both syntax and API mapping rules. J2SLNFERER represents inferred rules as string templates, stored in a database, to translate code from Java to Swift. We evaluated J2SLNFERER with four applications, using one part of the data to infer translation rules, and the other part to apply the rules. With 76% in-project accuracy and 65% cross-project accuracy, J2SLNFERER outperforms in accuracy j2swift, a state-of-the-art Java-to-Swift conversion tool. As native cross-platform mobile apps grow in popularity, J2SLNFERER can shorten their time to market by automating the tedious and error prone task of source-to-source translation.more » « less
-
C is an unsafe language. Researchers have been developing tools to port C to safer languages such as Rust, Checked C, or Go. Existing tools, however, resort to preprocessing the source file first, then porting the resulting code, leaving barely recognizable code that loses macro abstractions. To preserve macro usage, porting tools need analyses that understand macro behavior to port to equivalent constructs. But macro semantics differ from typical functions, precluding simple syntactic transformations to port them. We introduce the first comprehensive framework for analyzing the portability of macro usage. We decompose macro behavior into 26 fine-grained properties and implement a program analysis tool, called Maki, that identifies them in real-world code with 94% accuracy. We apply Maki to 21 programs containing a total of 86,199 macro definitions. We found that real-world macros are much more portable than previously known. More than a third (37%) are easy-to-port, and Maki provides hints for porting more complicated macros. We find, on average, 2x more easy-to-port macros and up to 7x more in the best case compared to prior work. Guided by Maki's output, we found and hand-ported macros in four real-world programs. We submitted patches to Linux maintainers that transform eleven macros, nine of which have been accepted.more » « less
-
Abstract AQME, automated quantum mechanical environments, is a free and open‐source Python package for the rapid deployment of automated workflows using cheminformatics and quantum chemistry. AQME workflows integrate tasks performed across multiple computational chemistry packages and data formats, preserving all computational protocols, data, and metadata for machine and human users to access and reuse. AQME has a modular structure of independent modules that can be implemented in any sequence, allowing the users to use all or only the desired parts of the program. The code has been developed for researchers with basic familiarity with the Python programming language. The CSEARCH module interfaces to molecular mechanics and semi‐empirical QM (SQM) conformer generation tools (e.g., RDKit and Conformer–Rotamer Ensemble Sampling Tool, CREST) starting from various initial structure formats. The CMIN module enables geometry refinement with SQM and neural network potentials, such as ANI. The QPREP module interfaces with multiple QM programs, such as Gaussian, ORCA, and PySCF. The QCORR module processes QM results, storing structural, energetic, and property data while also enabling automated error handling (i.e., convergence errors, wrong number of imaginary frequencies, isomerization, etc.) and job resubmission. The QDESCP module provides easy access to QM ensemble‐averaged molecular descriptors and computed properties, such as NMR spectra. Overall, AQME provides automated, transparent, and reproducible workflows to produce, analyze and archive computational chemistry results. SMILES inputs can be used, and many aspects of tedious human manipulation can be avoided. Installation and execution on Windows, macOS, and Linux platforms have been tested, and the code has been developed to support access through Jupyter Notebooks, the command line, and job submission (e.g., Slurm) scripts. Examples of pre‐configured workflows are available in various formats, and hands‐on video tutorials illustrate their use.
This article is categorized under:
Data Science > Chemoinformatics
Data Science > Computer Algorithms and Programming
Software > Quantum Chemistry