Abstract Proton transfers are fundamental steps in polar reaction mechanisms. We generated a large dataset of over 51 million kinetically plausible proton transfer steps between heteroatoms from about 8,000 acids and conjugate bases with experimental aqueous pKas, spanning pKavalues from −15 to +37. Rate factors were estimated at 25 °C using a simplified Eigen equation with pKas but without statistical factors. Steps with estimated rate constants ≥ 103M−1s−1were included in the final dataset. Additionally, 5,043 proton transfer steps from carbon acids to heteroatom bases were estimated using the Eigen-Bernasconi equation based on reported intrinsic rate constants and Brønsted β values. Carbon proton transfers with rate constants ≥ 103 M−1s−1were added to the final dataset. Each entry was encoded in SMIRKS format with electron-flow specification for machine learning compatibility. Diversity of structure was prioritized over diversity of conditions; calculated rate constants are expected to be accurate in aqueous environments. This approach and dataset should prove valuable for training models to predict stepwise mechanistic pathways.
more »
« less
Plausible Proton Transfer Data Files
A zipped file containing:51M_Heteroatom.csv - 51M proton transfer steps from heteroatom acids to heteroatom bases, with SMIRKS, calculated log k1, and pKas,5KCarbonPT.csv - 5K proton transfer steps from carbon acids to heteroatom bases, with SMIRKS, calculated log k1, and pKas, Brønsted β values, statistical factors (qB, pB, qC, pC) intrinsic rate constants (ko),49ExperimentalCarbonPT.csv - 49 proton transfers from heteroatom acids to carbon bases in SMIRKS format, with experimentally measured log k1, and literature references.51M_heteroatom_raw – a subfolder with two files containing lists of 7.6K heteroatomic acids and bases in SMILES format with the acidic and basic atoms labeled, with pKas, literature references: Acid.csv, ConBase.csv100_Heteroatom.csv - A representative sample set of 100 out of the 51M proton transfer steps100K_Heteroatom.csv - A representative sample set of 100,000 out of the 51M proton transfer stepscarbon_acid_raw – a subfolder containing a list of intrinsic rate constants for carbon acids in SMILES format, with statistical factors (Carbon_Acids.csv) and a subfolder named Bases containing seven lists of heteroatom base classes (ArO-.csv, R2NH.csv, R3N.csv, “RCO2- and ArCO2-.csv”, RNH2.csv, RO-.csv and RS-.csv). Lists of heteroatom bases are in SMILES format, sectioned by class and with statistical factors, selected from the Heteroatom set
more »
« less
- Award ID(s):
- 1955811
- PAR ID:
- 10670686
- Publisher / Repository:
- figshare
- Date Published:
- Subject(s) / Keyword(s):
- Organic chemistry not elsewhere classified Physical organic chemistry
- Format(s):
- Medium: X Size: 943147581 Bytes
- Size(s):
- 943147581 Bytes
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Gagliardi, Laura (Ed.)The formic acid-ammonia dimer is an important example of a hydrogen-bonded complex in which a double proton transfer can occur. Its microwave spectrum has recently been reported and rotational constants and quadrupole coupling constants were determined. Calculated estimates of the double-well barrier and the internal barriers to rotation were also reported. Here we report a full-dimensional potential energy surface (PES) for this complex, using two closely related Δ-machine learning methods to bring it to the CCSD(T) level of accuracy. The PES dissociates smoothly and accurately. Using a 2d quantum model the ground vibrational-state tunneling splitting is estimted to be less than 10−4 cm−1. The dipole moment along the intrinsic reaction coordinate is calculated along with a Mullikan charge analysis and supports mildly ionic character of the minimum and strongly ionic character at the double-well barrier.more » « less
-
Machine learning represents a milestone in data-driven research, including material informatics, robotics, and computer-aided drug discovery. With the continuously growing virtual and synthetically available chemical space, efficient and robust quantitative structure–activity relationship (QSAR) methods are required to uncover molecules with desired properties. Herein, we propose variable-length-array SMILES-based (VLA-SMILES) structural descriptors that expand conventional SMILES descriptors widely used in machine learning. This structural representation extends the family of numerically coded SMILES, particularly binary SMILES, to expedite the discovery of new deep learning QSAR models with high predictive ability. VLA-SMILES descriptors were shown to speed up the training of QSAR models based on multilayer perceptron (MLP) with optimized backpropagation (ATransformedBP), resilient propagation (iRPROP‒), and Adam optimization learning algorithms featuring rational train–test splitting, while improving the predictive ability toward the more compute-intensive binary SMILES representation format. All the tested MLPs under the same length-array-based SMILES descriptors showed similar predictive ability and convergence rate of training in combination with the considered learning procedures. Validation with the Kennard–Stone train–test splitting based on the structural descriptor similarity metrics was found more effective than the partitioning with the ranking by activity based on biological activity values metrics for the entire set of VLA-SMILES featured QSAR. Robustness and the predictive ability of MLP models based on VLA-SMILES were assessed via the method of QSAR parametric model validation. In addition, the method of the statistical H0 hypothesis testing of the linear regression between real and observed activities based on the F2,n−2 -criteria was used for predictability estimation among VLA-SMILES featured QSAR-MLPs (with n being the volume of the testing set). Both approaches of QSAR parametric model validation and statistical hypothesis testing were found to correlate when used for the quantitative evaluation of predictabilities of the designed QSAR models with VLA-SMILES descriptors.more » « less
-
Nanopore sequencing enables direct, single-molecule interrogation of biopolymers and shows promise for analyzing not only DNA and RNA but also chemically modified bases, proteins, and other polymers. Expanded DNA alphabets, such as those found in xenonucleic acids (XNAs), open new possibilities for diagnostics, therapeutics, data storage, and engineered biology. However, robust sequencing strategies for these modified molecules remain lacking. While nanopore-based tools exist for some noncanonical bases, they often require extensive experimental calibration by measuring each base across many sequence contexts, which limits scalability and increases cost. In this work, we investigate computational methods for predicting the ionic current signals produced during nanopore sequencing of DNA containing noncanonical XNA bases, aiming to reduce the need for experimental calibration. We compare a sequence-based predictive model with two structure-aware approaches: one using graph-based molecular representations and another adapting a generative language model to molecular SMILES. Our findings show that while sequence context captures much of the signal variability, incorporating structural and chemical information improves predictive accuracy in specific cases. These results highlight the value of structural data representations and model design in scaling XNA sequencing, and suggest this framework could extend to modeling ionic currents from other complex biomolecules, such as proteins.more » « less
-
Abstract Uracil DNA-glycosylase (UNG) is a DNA repair enzyme that removes the highly mutagenic uracil lesion from DNA using a base flipping mechanism. Although this enzyme has evolved to remove uracil from diverse sequence contexts, UNG excision efficiency depends on DNA sequence. To provide the molecular basis for rationalizing UNG substrate preferences, we used time-resolved fluorescence spectroscopy, NMR imino proton exchange measurements, and molecular dynamics simulations to measure UNG specificity constants ( k cat / K M ) and DNA flexibilities for DNA substrates containing central AUT, TUA, AUA, and TUT motifs. Our study shows that UNG efficiency is dictated by the intrinsic deformability around the lesion, establishes a direct relationship between substrate flexibility modes and UNG efficiency, and shows that bases immediately adjacent to the uracil are allosterically coupled and have the greatest impact on substrate flexibility and UNG activity. The finding that substrate flexibility controls UNG efficiency is likely significant for other repair enzymes and has major implications for the understanding of mutation hotspot genesis, molecular evolution, and base editing.more » « less
An official website of the United States government
