skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Data and code for, "Large language models design sequence-defined macromolecules via evolutionary optimization"
Codes and data for "Large language models design sequence-defined macromolecules via evolutionary optimization" Note this repository contains codes and data files for the manuscript. This is a snapshot of the repository, frozen at the time of submission. Codes: LLM codes, other algorithms, postprocessing, visualization Data files: prompts, models, embeddings, LLM responses  more » « less
Award ID(s):
2401663
PAR ID:
10581848
Author(s) / Creator(s):
;
Publisher / Repository:
Zenodo
Date Published:
Subject(s) / Keyword(s):
Polymer sciences Machine learning Artificial intelligence Molecular and chemical physics
Format(s):
Medium: X
Right(s):
Creative Commons Attribution 4.0 International
Sponsoring Org:
National Science Foundation
More Like this
  1. The repository link contains a README which gives an overview of the files along with the structure of the data.  Additionally, for LLAMA and GPT2, the files are in human_{llm_name}{i}.jsonl format where {llm} is the name of the LLM and {i} is the partition of the file and which can be concatenated to form the full dataset for that llm. 
    more » « less
  2. Large Language Models (LLMs) have demonstrated significant potential across various applications, but their use as AI copilots in complex and specialized tasks is often hindered by AI hallucinations, where models generate outputs that seem plausible but are incorrect. To address this challenge, we develop AutoFEA, an intelligent system that integrates LLMs with Finite Element Analysis (FEA) to automate the generation of FEA input files. Our approach features a novel planning method and a graph convolutional network (GCN)-Transformer Link Prediction retrieval model, which enhances the accuracy and reliability of the generated simulations. The AutoFEA system proceeds with key steps: dataset preparation, step-by-step planning, GCN-Transformer Link Prediction retrieval, LLM-driven code generation, and simulation using CalculiX. In this workflow, the GCN-Transformer model predicts and retrieves relevant example codes based on relationships between different steps in the FEA process, guiding the LLM in generating accurate simulation codes. We validate AutoFEA using a specialized dataset of 512 meticulously prepared FEA projects, which provides a robust foundation for training and evaluation. Our results demonstrate that AutoFEA significantly reduces AI hallucinations by grounding LLM outputs in physically accurate simulation data, thereby improving the success rate and accuracy of FEA simulations and paving the way for future advancements in AI-assisted engineering tasks. 
    more » « less
  3. We present the first results of a comprehensive supernova (SN) radiative-transfer (RT) code-comparison initiative (StaNdaRT), where the emission from the same set of standardised test models is simulated by currently used RT codes. We ran a total of ten codes on a set of four benchmark ejecta models of Type Ia SNe. We consider two sub-Chandrasekhar-mass (Mtot= 1.0M) toy models with analytic density and composition profiles and two Chandrasekhar-mass delayed-detonation models that are outcomes of hydrodynamical simulations. We adopt spherical symmetry for all four models. The results of the different codes, including the light curves, spectra, and the evolution of several physical properties as a function of radius and time are provided in electronic form in a standard format via a public repository. We also include the detailed test model profiles and several Python scripts for accessing and presenting the input and output files. We also provide the code used to generate the toy models studied here. In this paper, we describe the test models, radiative-transfer codes, and output formats in detail, and provide access to the repository. We present example results of several key diagnostic features. 
    more » « less
  4. The two-way exchange of water and properties such as heat and salinity as well as other suspended material between estuaries and the coastal ocean is important to regulating these marine habitats. This exchange can be challenging to measure. The Total Exchange Flow (TEF) method provides a way to organize the complexity of this exchange into distinct layers based on a given water property. This method has primarily been applied in numerical models that provide high resolution output in space and time. The goal here is to identify the minimum horizontal and vertical sampling resolutions needed to measure TEF depending on estuary type. Results from three realistic hydrodynamic models were investigated. These models included three estuary types: bay (San Diego Bay: data/SDB_*.mat files), salt-wedge (Columbia River: data/CR_*.mat files), and fjord (Salish Sea: data/SJF_*.mat files). The models were sampled using three different mooring strategies, varying the number of mooring locations and sample depths with each method. This repository includes the Matlab code for repeating these sampling methods and TEF calculations using the data from the three estuary models listed above. 
    more » « less
  5. Atomic force microscopy (AFM) image raw data, force spectroscopy raw data, data analysis/data plotting, and force modeling. File Formats The raw files of the AFM imaging scans of the colloidal probe surface are provided in NT-MDTs proprietary .mdt file format, which can be opened using the Gwyddion software package. Gwyddion has been released under the GNU public software license GPLv3 and can be downloaded free of charge at http://gwyddion.net/. The processed image files are included in Gwyddions .gwy file format. Force spectroscopy raw files are also provided in .mdt file format, which can be opened using NT-MDTs NOVA Px software (we used 3.2.5 rev. 10881). All the force data were converted to ASCII files (*.txt) using the NOVA Px software to also provide them in human readable form with this data set. The MATLAB codes used for force curve processing and data analysis are given as *.m files and can be opened by MATLAB (https://www.mathworks.com/products/matlab) or by a text editor. The raw and processed force curve data and other values used for data processing are stored in binary form in *.mat MATLAB data files, which can be opened by MATLAB. Organized by figure, all the raw and processed force curve data are given in Excel worksheets (*.xlsx), one per probe/substrate combination. Data (Folder Structure) The data in the dataverse is best viewed in Tree mode. Codes for Force Curve Processing The three MATLAB codes used for force curve processing are contained in this folder. The text file Read me.txt provides all the instructions to process raw force data using these three MATLAB codes. Figure 3B, 3C – AFM images The raw (.mdt) and processed (.gwy) AFM images of the colloidal probe before and after coating with graphene oxide (GO) are contained in this folder. Figure 4 – Force Curve GO The raw data of the force curve shown in Figure 4 and the substrate force curve data (used to find inverse optical lever sensitivity) are given as .mdt files and were exported as ASCII files given in the same folder. The raw and processed force curve data are also given in the variables_GO_Tip 18.mat and GO_Tip 18.xlsx files. The force curve processing codes and instructions can be found in the Codes for Force Curve Processing folder, as mentioned above. Figure 5A – Force–Displacement Curves GO, rGO1, rGO10 All the raw data of the force curves (GO, rGO1, rGO10) shown in Figure 5A and the corresponding substrate force curve data (used to find inverse optical lever sensitivity) are given as .mdt files and were exported as ASCII files given in the same folder. The raw and processed force curve data are also given in *.mat and *.xlsx files. Figure 5B, 5C – Averages of Force and Displacement for Snap-On and Pull-Off Events All the raw data of the force curves (GO, rGO1, rGO10) for all the probes and corresponding substrate force curve data are given as .mdt files and were exported as ASCII files given in this folder. The raw and processed force curve data are also provided in *.mat and *.xlsx files. The snap-on force, snap-on displacement, and pull-off displacement values were obtained from each force curve and averaged as in Code_Figure5B_5C.m. The same code was used for plotting the average values. Figure 6A – Force–Distance Curves GO, rGO1, rGO10 The raw data provided in Figure 5A – Force Displacement Curves GO, rGO1, rGO10 folder were processed into force-vs-distance curves. The raw and processed force curve data are also given in *.mat and *.xlsx files. Figure 6B – Average Snap-On and Pull-Off Distances The same raw data provided in Figure 5B, 5C – Average Snap on Force, Displacement, Pull off Displacement folder were processed into force-vs-distance curves. The raw and processed force curve data of GO, rGO1, rGO10 of all the probes are also given in *.mat and *.xlsx files. The snap-on distance and pull-off distance values were obtained from each force curve and averaged as in Code_Figure6B.m. The code used for plotting is also given in the same text file. Figure 6C – Contact Angles Advancing and receding contact angles were calculated using each processed force-vs-distance curve and averaged according to the reduction time. The obtained values and the code used to plot is given in Code_Figure6C.m. Figure 9A – Force Curve Repetition The raw data of all five force curves and the substrate force curve data are given as .mdt files and were exported as ASCII files given in the same folder. The raw and processed force curve data are also given in *.mat and *.xlsx files. Figure 9B – Repulsive Force Comparison The data of the zoomed-in region of Figure 9A was plotted as Experimental curve. Initial baseline correction was done using the MATLAB code bc.m, and the procedure is given in the Read Me.txt text file. All the raw and processed data are given in rGO10_Tip19_Trial1.xlsx and variables_rGO10_Tip 19.mat files. The MATLAB code used to model other forces and plot all the curves in Figure 9B is given in Exp_vdW_EDL.m. 
    more » « less