NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Biophysics-based protein language models for protein engineering

https://doi.org/10.1038/s41592-025-02776-2

Gelman, Sam; Johnson, Bryce; Freschlin, Chase R; Sharma, Arnav; D’Costa, Sameer; Peters, John; Gitter, Anthony; Romero, Philip A (September 2025, Nature Methods)

Protein language models trained on evolutionary data have emerged as powerful tools for predictive problems involving protein sequence, structure and function. However, these models overlook decades of research into biophysical factors governing protein function. We propose mutational effect transfer learning (METL), a protein language model framework that unites advanced machine learning and biophysical modeling. Using the METL framework, we pretrain transformer-based neural networks on biophysical simulation data to capture fundamental relationships between protein sequence, structure and energetics. We fine-tune METL on experimental sequence–function data to harness these biophysical signals and apply them when predicting protein properties like thermostability, catalytic activity and fluorescence. METL excels in challenging protein engineering tasks like generalizing from small training sets and position extrapolation, although existing methods that train on evolutionary signals remain powerful for many types of experimental assays. We demonstrate METL’s ability to design functional green fluorescent protein variants when trained on only 64 examples, showcasing the potential of biophysics-based protein language models for protein engineering.
more » « less
Free, publicly-accessible full text available September 1, 2026
Using the Open Science Data Federation for data distribution: Big Bear Solar Observatory use case

https://doi.org/10.1145/3708035.3736032

Montiel, Sydney; Guadarrama, Alexsandra; Andrijauskas, Fabio (July 2025, ACM)

Free, publicly-accessible full text available July 18, 2026
Deep learning inference with the Event Horizon Telescope: I. Calibration improvements and a comprehensive synthetic data library

https://doi.org/10.1051/0004-6361/202553784

Janssen, M; Chan, C-k; Davelaar, J; Natarajan, I; Olivares, H; Ripperda, B; Röder, J; Rynge, M; Wielgus, M (June 2025, Astronomy & Astrophysics)

Context. In a series of publications, we describe a comprehensive comparison of Event Horizon Telescope (EHT) data with theoretical models of the observed Sagittarius A* (Sgr A^*) and Messier 87* (M87^*) horizon-scale sources. Aims. In this article, we report on improvements made to our observational data reduction pipeline and present the generation of observables derived from the EHT models. We make use of ray-traced general relativistic magnetohydrodynamic simulations that are based on different black hole spacetime metrics and accretion physics parameters. These broad classes of models provide a good representation of the primary targets observed by the EHT. Methods. We describe how we combined multiple frequency bands and polarization channels of the observational data to improve our fringe-finding sensitivity and stabilization of atmospheric phase fluctuations. To generate realistic synthetic data from our models, we took the signal path as well as the calibration process, and thereby the aforementioned improvements, into account. We could thus produce synthetic visibilities akin to calibrated EHT data and identify salient features for the discrimination of model parameters. Results. We have produced a library consisting of an unparalleled 962 000 synthetic Sgr A^*and M87^*datasets. In terms of baseline coverage and noise properties, the library encompasses 2017 EHT measurements as well as future observations with an extended telescope array. Conclusions. We differentiate between robust visibility data products related to model features and data products that are strongly affected by data corruption effects. Parameter inference is mostly limited by intrinsic model variability, which highlights the importance of long-term monitoring observations with the EHT. In later papers in this series, we will show how a Bayesian neural network trained on our synthetic data is capable of dealing with the model variability and extracting physical parameters from EHT observations. With our calibration improvements, our newly reduced EHT datasets have a considerably better quality compared to previously analyzed data.
more » « less
Free, publicly-accessible full text available June 1, 2026
Predicting Future Accesses in XRootD Caching Systems using ML-Based Network Pattern Analysis

https://doi.org/10.1109/ANTS63515.2024.10898672

Barla, Sarat Sasank; Karanam, Venkat_Sai_Suman Lamba; Ramamurthy, Byrav; Weitzel, Derek (December 2024, IEEE)

Free, publicly-accessible full text available December 15, 2025
Open Science Data Federation - operation and monitoring

https://doi.org/10.1145/3626203.3670557

Andrijauskas, Fabio; Weitzel, Derek; Wuerthwein, Frank (July 2024, ACM)

Full Text Available
Demand-driven provisioning of Kubernetes-like resources in OSG

https://doi.org/10.1051/epjconf/202429507014

Sfiligoi, Igor; Würthwein, Frank; Dost, Jeff; Lin, Brian; Schultz, David (January 2024, EPJ Web of Conferences)
De_Vita, R; Espinal, X; Laycock, P; Shadura, O (Ed.)
The OSG-operated Open Science Pool is an HTCondor-based virtual cluster that aggregates resources from compute clusters provided by several organizations. Most of the resources are not owned by OSG, so demand-based dynamic provisioning is important for maximizing usage without incurring excessive waste. OSG has long relied on GlideinWMS for most of its resource provisioning needs but is limited to resources that provide a Grid-compliant Compute Entrypoint. To work around this limitation, the OSG Software Team has developed a glidein container that resource providers could use to directly contribute to the OSPool. The problem with that approach is that it is not demand-driven, relegating it to backfill scenarios only. To address this limitation, a demand-driven direct provisioner of Kubernetes resources has been developed and successfully used on the NRP. The setup still relies on the OSG-maintained backfill container image but automates the provisioning matchmaking and successive requests. That provisioner has also been extended to support Lancium, a green computing cloud provider with a Kubernetes-like proprietary interface. The provisioner logic has been intentionally kept very simple, making this extension a low-cost project. Both NRP and Lancium resources have been provisioned exclusively using this mechanism for many months.
more » « less
Full Text Available
CRIU - Checkpoint Restore in Userspace for computational simulations and scientific applications

https://doi.org/10.1051/epjconf/202429507046

Andrijauskas, Fabio; Sfiligoi, Igor; Davila, Diego; Arora, Aashay; Guiang, Jonathan; Bockelman, Brian; Thain, Greg; Würthwein, Frank (January 2024, EPJ Web of Conferences)
De_Vita, R; Espinal, X; Laycock, P; Shadura, O (Ed.)
Creating new materials, discovering new drugs, and simulating systems are essential processes for research and innovation and require substantial computational power. While many applications can be split into many smaller independent tasks, some cannot and may take hours or weeks to run to completion. To better manage those longer-running jobs, it would be desirable to stop them at any arbitrary point in time and later continue their computation on another compute resource; this is usually referred to as checkpointing. While some applications can manage checkpointing programmatically, it would be preferable if the batch scheduling system could do that independently. This paper evaluates the feasibility of using CRIU (Checkpoint Restore in Userspace), an open-source tool for the GNU/Linux environments, emphasizing the OSG’s OSPool HTCondor setup. CRIU allows checkpointing the process state into a disk image and can deal with both open files and established network connections seamlessly. Furthermore, it can checkpoint traditional Linux processes and containerized workloads. The functionality seems adequate for many scenarios supported in the OSPool. However, some limitations prevent it from being usable in all circumstances.
more » « less
Full Text Available
IceCube experience using XRootD-based Origins with GPU workflows in PNRP

https://doi.org/10.1051/epjconf/202429511011

Schultz, David; Sfiligoi, Igor; Riedel, Benedikt; Andrijauskas, Fabio; Weitzel, Derek; Würthwein, Frank (January 2024, EPJ Web of Conferences)
De_Vita, R; Espinal, X; Laycock, P; Shadura, O (Ed.)
The IceCube Neutrino Observatory is a cubic kilometer neutrino telescope located at the geographic South Pole. Understanding detector systematic effects is a continuous process. This requires the Monte Carlo simulation to be updated periodically to quantify potential changes and improvements in science results with more detailed modeling of the systematic effects. IceCube’s largest systematic effect comes from the optical properties of the ice the detector is embedded in. Over the last few years there have been considerable improvements in the understanding of the ice, which require a significant processing campaign to update the simulation. IceCube normally stores the results in a central storage system at the University of Wisconsin–Madison, but it ran out of disk space in 2022. The Prototype National Research Platform (PNRP) project thus offered to provide both GPU compute and storage capacity to IceCube in support of this activity. The storage access was provided via XRootD-based OSDF Origins, a first for IceCube computing. We report on the overall experience using PNRP resources, with both successes and pain points.
more » « less
Full Text Available
400Gbps benchmark of XRootD HTTP-TPC

https://doi.org/10.1051/epjconf/202429501001

Arora, Aashay; Guiang, Jonathan; Davila, Diego; Würthwein, Frank; Balcas, Justas; Newman, Harvey (January 2024, EPJ Web of Conferences)
De_Vita, R; Espinal, X; Laycock, P; Shadura, O (Ed.)
Due to the increased demand of network traffic expected during the HL-LHC era, the T2 sites in the USA will be required to have 400Gbps of available bandwidth to their storage solution. With the above in mind we are pursuing a scale test of XRootD software when used to perform Third Party Copy transfers using the HTTP protocol. Our main objective is to understand the possible limitations in the software stack to achieve the target transfer rate; to that end we have set up a testbed of multiple XRootD servers in both UCSD and Caltech which are connected through a dedicated link capable of 400 Gbps end-to-end. Building upon our experience deploying containerized XRootD servers, we use Kubernetes to easily deploy and test different configurations of our testbed. In this work, we will present our experience doing these tests and the lessons learned.
more » « less
Full Text Available
Microarchitecture: A useful tool to organize machines in heterogeneous shared computing environments

https://doi.org/10.1051/epjconf/202429504040

Thain, Gregory; Sfiligoi, Igor (January 2024, EPJ Web of Conferences)
De_Vita, R; Espinal, X; Laycock, P; Shadura, O (Ed.)
The x86_64 instruction set architecture is not a single, consistent, compatible interface to execute computer programs. Since the initial release in 1999, every new generation has added new instructions, some of which were later removed. Most of these new instructions are intended to improve the performance of those programs which explicitly take advantage of them. However, running such a program on older CPUs without appropriate support, results in Linux SIGILL exception signal, which is difficult for end users to diagnose. On the other hand, compiling scientific code for the least common denominator ISA can leave significant performance on the table. High Throughput systems, containing very large number of machines, cannot require a single CPU version across hundreds of thousands of machines operating in dozens of sites. The OSG Open Science Pool alone consists of more than 20 different, subtly incompatible X86_64 implementations. In 2020, Intel, AMD and RedHat proposed new terminology and partitioned these dozens of microarchitectures into a strict hierarchy of four groups. The HTCondor Software Suite and the OSG now have first class support for these microarchitectures. This paper discusses the advantages for users and future work around microarchitecture support.
more » « less
Full Text Available

« Prev Next »

Search for: All records