skip to main content


Title: Dimensionally reduced machine learning model for predicting single component octanol–water partition coefficients
Abstract

MF-LOGP, a new method for determining a single component octanol–water partition coefficients ($$LogP$$LogP) is presented which uses molecular formula as the only input. Octanol–water partition coefficients are useful in many applications, ranging from environmental fate and drug delivery. Currently, partition coefficients are either experimentally measured or predicted as a function of structural fragments, topological descriptors, or thermodynamic properties known or calculated from precise molecular structures. The MF-LOGP method presented here differs from classical methods as it does not require any structural information and uses molecular formula as the sole model input. MF-LOGP is therefore useful for situations in which the structure is unknown or where the use of a low dimensional, easily automatable, and computationally inexpensive calculations is required. MF-LOGP is a random forest algorithm that is trained and tested on 15,377 data points, using 10 features derived from the molecular formula to make$$LogP$$LogPpredictions. Using an independent validation set of 2713 data points, MF-LOGP was found to have an average$$RMSE$$RMSE= 0.77 ± 0.007,$$MAE$$MAE= 0.52 ± 0.003, and$${R}^{2}$$R2= 0.83 ± 0.003. This performance fell within the spectrum of performances reported in the published literature for conventional higher dimensional models ($$RMSE$$RMSE= 0.42–1.54,$$MAE$$MAE= 0.09–1.07, and$${R}^{2}$$R2= 0.32–0.95). Compared with existing models, MF-LOGP requires a maximum of ten features and no structural information, thereby providing a practical and yet predictive tool. The development of MF-LOGP provides the groundwork for development of more physical prediction models leveraging big data analytical methods or complex multicomponent mixtures.

Graphical Abstract

 
more » « less
Award ID(s):
2021871
NSF-PAR ID:
10391990
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Journal of Cheminformatics
Volume:
15
Issue:
1
ISSN:
1758-2946
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    We perform path-integral molecular dynamics (PIMD), ring-polymer MD (RPMD), and classical MD simulations of H$$_2$$2O and D$$_2$$2O using the q-TIP4P/F water model over a wide range of temperatures and pressures. The density$$\rho (T)$$ρ(T), isothermal compressibility$$\kappa _T(T)$$κT(T), and self-diffusion coefficientsD(T) of H$$_2$$2O and D$$_2$$2O are in excellent agreement with available experimental data; the isobaric heat capacity$$C_P(T)$$CP(T)obtained from PIMD and MD simulations agree qualitatively well with the experiments. Some of these thermodynamic properties exhibit anomalous maxima upon isobaric cooling, consistent with recent experiments and with the possibility that H$$_2$$2O and D$$_2$$2O exhibit a liquid-liquid critical point (LLCP) at low temperatures and positive pressures. The data from PIMD/MD for H$$_2$$2O and D$$_2$$2O can be fitted remarkably well using the Two-State-Equation-of-State (TSEOS). Using the TSEOS, we estimate that the LLCP for q-TIP4P/F H$$_2$$2O, from PIMD simulations, is located at$$P_c = 167 \pm 9$$Pc=167±9 MPa,$$T_c = 159 \pm 6$$Tc=159±6 K, and$$\rho _c = 1.02 \pm 0.01$$ρc=1.02±0.01 g/cm$$^3$$3. Isotope substitution effects are important; the LLCP location in q-TIP4P/F D$$_2$$2O is estimated to be$$P_c = 176 \pm 4$$Pc=176±4 MPa,$$T_c = 177 \pm 2$$Tc=177±2 K, and$$\rho _c = 1.13 \pm 0.01$$ρc=1.13±0.01 g/cm$$^3$$3. Interestingly, for the water model studied, differences in the LLCP location from PIMD and MD simulations suggest that nuclear quantum effects (i.e., atoms delocalization) play an important role in the thermodynamics of water around the LLCP (from the MD simulations of q-TIP4P/F water,$$P_c = 203 \pm 4$$Pc=203±4 MPa,$$T_c = 175 \pm 2$$Tc=175±2 K, and$$\rho _c = 1.03 \pm 0.01$$ρc=1.03±0.01 g/cm$$^3$$3). Overall, our results strongly support the LLPT scenario to explain water anomalous behavior, independently of the fundamental differences between classical MD and PIMD techniques. The reported values of$$T_c$$Tcfor D$$_2$$2O and, particularly, H$$_2$$2O suggest that improved water models are needed for the study of supercooled water.

     
    more » « less
  2. Abstract

    Consider two half-spaces$$H_1^+$$H1+and$$H_2^+$$H2+in$${\mathbb {R}}^{d+1}$$Rd+1whose bounding hyperplanes$$H_1$$H1and$$H_2$$H2are orthogonal and pass through the origin. The intersection$${\mathbb {S}}_{2,+}^d:={\mathbb {S}}^d\cap H_1^+\cap H_2^+$$S2,+d:=SdH1+H2+is a spherical convex subset of thed-dimensional unit sphere$${\mathbb {S}}^d$$Sd, which contains a great subsphere of dimension$$d-2$$d-2and is called a spherical wedge. Choosenindependent random points uniformly at random on$${\mathbb {S}}_{2,+}^d$$S2,+dand consider the expected facet number of the spherical convex hull of these points. It is shown that, up to terms of lower order, this expectation grows like a constant multiple of$$\log n$$logn. A similar behaviour is obtained for the expected facet number of a homogeneous Poisson point process on$${\mathbb {S}}_{2,+}^d$$S2,+d. The result is compared to the corresponding behaviour of classical Euclidean random polytopes and of spherical random polytopes on a half-sphere.

     
    more » « less
  3. Abstract

    A steady-state, semi-analytical model of energetic particle acceleration in radio-jet shear flows due to cosmic-ray viscosity obtained by Webb et al. is generalized to take into account more general cosmic-ray boundary spectra. This involves solving a mixed Dirichlet–Von Neumann boundary value problem at the edge of the jet. The energetic particle distribution functionf0(r,p) at cylindrical radiusrfrom the jet axis (assumed to lie along thez-axis) is given by convolving the particle momentum spectrumf0(,p)with the Green’s functionG(r,p;p), which describes the monoenergetic spectrum solution in whichf0δ(pp)asr→ ∞ . Previous work by Webb et al. studied only the Green’s function solution forG(r,p;p). In this paper, we explore for the first time, solutions for more general and realistic forms forf0(,p). The flow velocityu=u(r)ezis along the axis of the jet (thez-axis).uis independent ofz, andu(r) is a monotonic decreasing function ofr. The scattering timeτ(r,p)=τ0(p/p0)αin the shear flow region 0 <r<r2, andτ(r,p)=τ0(p/p0)α(r/r2)s, wheres> 0 in the regionr>r2is outside the jet. Other original aspects of the analysis are (i) the use of cosmic ray flow lines in (r,p) space to clarify the particle spatial transport and momentum changes and (ii) the determination of the probability distributionψp(r,p;p)that particles observed at (r,p) originated fromr→ ∞ with momentump. The acceleration of ultrahigh-energy cosmic rays in active galactic nuclei jet sources is discussed. Leaky box models for electron acceleration are described.

     
    more » « less
  4. Abstract

    Let$$\textbf{p}$$pbe a configuration ofnpoints in$$\mathbb R^d$$Rdfor somenand some$$d \ge 2$$d2. Each pair of points defines an edge, which has a Euclidean length in the configuration. A path is an ordered sequence of the points, and a loop is a path that begins and ends at the same point. A path or loop, as a sequence of edges, also has a Euclidean length, which is simply the sum of its Euclidean edge lengths. We are interested in reconstructing$$\textbf{p}$$pgiven a set of edge, path and loop lengths. In particular, we consider the unlabeled setting where the lengths are given simply as a set of real numbers, and are not labeled with the combinatorial data describing which paths or loops gave rise to these lengths. In this paper, we study the question of when$$\textbf{p}$$pwill be uniquely determined (up to an unknowable Euclidean transform) from some given set of path or loop lengths through an exhaustive trilateration process. Such a process has already been used for the simpler problem of reconstruction using unlabeled edge lengths. This paper also provides a complete proof that this process must work in that edge-setting when given a sufficiently rich set of edge measurements and assuming that$$\textbf{p}$$pis generic.

     
    more » « less
  5. Abstract

    We study the family of irreducible modules for quantum affine𝔰𝔩n+1{\mathfrak{sl}_{n+1}}whose Drinfeld polynomials are supported on just one node of the Dynkin diagram. We identify all the prime modules in this family and prove a unique factorization theorem. The Drinfeld polynomials of the prime modules encode information coming from the points of reducibility of tensor products of the fundamental modules associated toAm{A_{m}}withmn{m\leq n}. These prime modules are a special class of the snake modules studied by Mukhin and Young. We relate our modules to the work of Hernandez and Leclerc and define generalizations of the category𝒞-{\mathscr{C}^{-}}. This leads naturally to the notion of an inflation of the corresponding Grothendieck ring. In the last section we show that the tensor product of a (higher order) Kirillov–Reshetikhin module with its dual always contains an imaginary module in its Jordan–Hölder series and give an explicit formula for its Drinfeld polynomial. Together with the results of [D. Hernandez and B. Leclerc,A cluster algebra approach toq-characters of Kirillov–Reshetikhin modules,J. Eur. Math. Soc. (JEMS) 18 2016, 5, 1113–1159] this gives examples of a product of cluster variables which are not in the span of cluster monomials. We also discuss the connection of our work with the examples arising from the work of [E. Lapid and A. Mínguez,Geometric conditions for\square-irreducibility of certain representations of the general linear group over a non-archimedean local field,Adv. Math. 339 2018, 113–190]. Finally, we use our methods to give a family of imaginary modules in typeD4{D_{4}}which do not arise from an embedding ofAr{A_{r}}withr3{r\leq 3}inD4{D_{4}}.

     
    more » « less