<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>L+M-24: Building a Dataset for Language+Molecules @ ACL 2024</title></titleStmt>
			<publicationStmt>
				<publisher>Association for Computational Linguistics</publisher>
				<date>01/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10633808</idno>
					<idno type="doi">10.18653/v1/2024.langmol-1.1</idno>
					
					<author>Carl Edwards</author><author>Qingyun Wang</author><author>Lawrence Zhao</author><author>Heng Ji</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Language-molecule models have emerged as an exciting direction for molecular discovery and understanding. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases, 2) large but noisy and constructed by performing entity linking on the scientific literature, and 3) built by converting property prediction datasets to natural language using templates. In this document, we detail the L+M-24 dataset, which has been created for the Language + Molecules Workshop shared task at ACL 2024. In particular, L+M-24 is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction. 1]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The world faces an enormous number of problems in the coming decades on scales of complexity never-before-seen, in areas such as climate change, healthcare, and pandemics. To address these issues, we need to discover inventive scientific solutions which are scalable, flexible, and inexpensive. Broadly speaking, many of these problems will require molecular solutions from the chemistry domain, such as developing new drugs (e.g. kinase inhibitors <ref type="bibr">(Ferguson and Gray, 2018)</ref>), materials (e.g. organic photovoltaics <ref type="bibr">(Kippelen and Br&#233;das, 2009)</ref>), and chemical processes <ref type="bibr">(Zhong et al., 2023)</ref>. These solutions exist in extremely large search spaces, which makes AI tools a necessity.</p><p>Language-molecule models have emerged as an exciting direction for molecular discovery and understanding <ref type="bibr">(Edwards et al., 2021;</ref><ref type="bibr">Zeng et al., 2022;</ref><ref type="bibr">Edwards et al., 2022;</ref><ref type="bibr">Su et al., 2022;</ref><ref type="bibr">Liu et al., 2022;</ref><ref type="bibr">Xu et al., 2023;</ref><ref type="bibr">Christofidellis et al., 2023;</ref><ref type="bibr">Liu et al., 2023b;</ref><ref type="bibr">Luo et al., 2023;</ref><ref type="bibr">Zhao et al., 2023c;</ref><ref type="bibr">Seidl et al., 2023)</ref>. However, training these models is challenging due to the scarcity of molecule-language pair datasets. At this point, datasets have been released which are 1) small and scraped from existing databases <ref type="bibr">(Edwards et al., 2021;</ref><ref type="bibr">Zeng et al., 2023;</ref><ref type="bibr">Liu et al., 2023a,c;</ref><ref type="bibr">Pei et al., 2023)</ref>, 2) large but noisy and constructed by performing entity linking on the scientific literature <ref type="bibr">(Zeng et al., 2022;</ref><ref type="bibr">Su et al., 2022)</ref>, and 3) templatebased built on prediction datasets <ref type="bibr">(Zhao et al., 2023a;</ref><ref type="bibr">Fang et al., 2023)</ref>. Approaches utilizing pseudo-data have also been attempted <ref type="bibr">(Chen et al., 2023a)</ref>. These approaches have helped remedy the problem of data scarcity in this domain; however, these approaches frequently ignore key benefits of natural language: 1) compositionality, 2) abstraction, and 3) functionality <ref type="bibr">(Zhang et al., 2023)</ref>. To this end, for the Language + Molecules Workshop at ACL 2024, we release L+M-24, which we construct to test these three goals, particularly compositionality, using recently released data sources <ref type="bibr">(Zhao et al., 2023b;</ref><ref type="bibr">Kosonocky et al., 2023;</ref><ref type="bibr">Wishart et al., 2023)</ref>. L+M-24 is divided into four categories with important applications in the small-molecule domain: 1) Biomedical, 2) Light and Electricity, 3) Human Interaction and Organoleptics, and 4) Agriculture and Industry. Improving understanding of these applications can have important implications in problems such as drug discovery, climate issues, more efficient and green industrial processes, and improved food production.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Task Formulation</head><p>The dataset is primarily intended for language&#8596;molecule translation, which consists of two tasks: generating 1) a caption given a molecule and 2) a molecule given a description.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Designing for Compositionality, Abstraction, and Function</head><p>Overall, we focused on four primary categories of importance: 1) Biomedical, 2) Light and Electricity, 3) Human Interaction and Organoleptics, and 4) Agriculture and Industry. These categories and three properties from each are displayed in Table <ref type="table">1</ref>. The biomedical category is focused on drug properties, functions, and interaction with proteins.</p><p>Light and electricity is focused on the ability for a molecule to produce or absorb light or electricity.</p><p>Human interaction and organoleptics focuses on the effect and experience molecules cause in humans.</p><p>Agriculture and industry focuses on molecules used in industrial processes and food production.</p><p>Based on our data sources (below), the properties we have selected already encode a large degree of functionality, enhanced by our manual curation. Further, since these properties are generally short phrases indicating functionality, they are also abstract and apply to many molecules (e.g., "insecticide"). For compositionality, we explicitly select certain pairs of properties which we hold out of the dataset. For example, a molecule may share two properties which are desirable together (e.g., low toxicity and fungicidal). L+M-24 will help to evaluate whether model's can generalize to unseen compositions of properties.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Data Sources</head><p>We constructed our dataset using three different databases. We will first describe the process we used to extract information from each, followed by our overall strategy for adding hierarchy into the dataset. We want to deeply thank the authors of these resources for making them publicly available for the community.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">PubChem</head><p>We used properties extracted from PubChem <ref type="bibr">(Kim et al., 2016</ref><ref type="bibr">(Kim et al., , 2019) )</ref> as described in <ref type="bibr">(Zhao et al., 2023c)</ref>. Properties from this approach include odor, taste, and decomposition. We note these properties consist of molecule-specific descriptions, which the other data sources do not provide.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Chemical Function (CheF)</head><p>Here, we used functional properties extracted from patent literature by <ref type="bibr">Kosonocky et al. (2023)</ref>. This allowed us to capture molecules from the patent literature in addition to the scientific literature. Here,</p><p>Biomedical anti neoplastic glaucoma treatment capillarigenic Light and Electricity photoelectric conversion photopolymerization dielectric Human Interaction pungent bitter nephrotoxic Agriculture and Industry herbicide emulsifier carcinogen Table 1: Example properties in the dataset. Antineoplastic drugs are used to treat cancer. Glaucoma is a group of eye diseases. Capillarigenic means producing or causing capillaries. Pungent means having a strong taste or smell. Nephrotoxic is toxicity in the kidneys. Photoelectric conversion is the conversion of light into electricity. Photopolymerization is the process through which monomers are linked together through a photochemical reaction. A dielectric is a poor conductor of electricity but can be polarized. A herbidicde is toxic to plants. An emulsifier stabilizes an emulsion. A carcinogen is an agent capable of causing cancer. The full property list and number of occurrences is available in the online data repository.</p><p>we started with CheF prefinal_v3 2 . We created a set of properties from both CheF's property summarizations and from the ChatGPT summarization source. For the summarization source, we also applied the WordNet lemmatizer <ref type="bibr">(Bird et al., 2009)</ref> for deduplication. After obtaining a list of properties, we removed properties pertaining to less than 100 molecules. We then kept properties falling into the categories of "X-icide", "anti-X", "X treatment", "X modulators", "X inhibitors", "X agonists", "X antagonists", "light", and "electricity." We manually removed uninformative labels which were too broad or didn't describe enough function. Further, we manually corrected errors in label naming and duplication.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">ChemFOnt: the chemical functional ontology resource</head><p>In addition to CheF, we also take advantage of another new chemical function data resource: Chem-FOnt <ref type="bibr">(Wishart et al., 2023)</ref>. From this datasource, we collect three categories: health effect relations, organoleptic effect relations, and role relations.</p><p>"alzheimer's treatment" "bace1 inhibitor"</p><p>"anti viral agent" "mitogen" "anti carcinogenic" "lipoxygenase inhibitor" "fungicide" "anti oxidant "</p><p>The molecule is both a alzheimer's treatment and a bace1 inhibitor.</p><p>The molecule is a mitogen and lipoxygenase inhibitor, belonging to the anti oxidant class, and is characterized as anti viral agent, anti carcinogenic, and fungicide.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Data Sources GPT 4 Written Templates Molecules with Multiple Properties Compositional Captions</head><p>Figure <ref type="figure">1</ref>: Example descriptions created for molecules from the training set.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Template Generation</head><p>We utilize <ref type="bibr">GPT-4 (OpenAI, 2023)</ref> to generate specific templates for each combinations of molecular properties. Specifically, we manually write six templates: "The molecule is a &lt;0&gt;. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Converting Templates to Descriptions</head><p>For all properties in L+M-24, we first assigned them to possible templates based on their category or by individual consideration. Certain properties (e.g., polymerization, decomposition) were expressed in sentence format, so we did not use templates. Given a molecule with n properties, we first looked for a template that had the correct slots (e.g., &lt;0&gt;, &lt;2&gt;, and &lt;2&gt;) for its properties.</p><p>When we found possible templates, we picked one at random and used it to generate a sentence for the molecule's properties. If there were no matching templates, we split the properties into two separate equal-sized groups and tried with each group. We return the concatenation of the two sentence templates as the molecule description. Note this process can repeat multiple times.</p><p>We note that we are also releasing a version of the dataset with 5 captions for each molecule. In this case, we split group sizes at random. Further, we split sentences apart 50% of the time (even when there were matching templates) to increase caption diversity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Splitting</head><p>Duplicate molecules are merged using RDKit (Landrum, 2021) and molecules which cannot be processed are removed. We split the data by first examining property combinations. 20% of combinations are witheld into the evaluation set. From molecules in the remaining 80%, we keep 80% for training and put 20% in evaluation. The evaluation set is split into two tasks: molecule captioning and molecule generation. For each task, only one modality will be released prior to the shared task results.</p><p>The training set consists of 160,492 moleculedescription pairs. For the evaluation set, both molecule generation and captioning contain 21,839 pairs. Further, special splits are released for the training set which allow for validation using the training data. They are constructed using the same procedure as the official evaluation dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Evaluation Metrics</head><p>Overall, we adopt the evaluation metrics proposed by <ref type="bibr">Edwards et al. (2022)</ref>. However, we include invalid molecules in the calculations of FTS metrics (setting the score to zero for invalid molecules). We also add a uniqueness metric to the generated molecules for held-out combinations of properties <ref type="bibr">(Polykovskiy et al., 2020)</ref>. Further, we also look at property-specific precision, recall, and F-1 scores. These scores are calculated by matching tokenized names in the generated captions. These scores are Table 3: Property-specific molecule captioning results on the validation split of L+M-24. Table 6: Molecule generation results on the subset of held-out combinations from the validation split of L+M-24 (2107 data points).</p><p>further aggregated across specific properties (e.g., inhibitors, X-icides, etc.) and the four broad categories. Aggregations are performed by averaging scores (i.e., macro-F1). We further compute these scores specifically for held-out combinations of properties.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Benchmarks</head><p>MolT5 models <ref type="bibr">(Edwards et al., 2022)</ref> were finetuned for 20 epochs on the "split_train" data split and evaluated on the "split_valid", both of which are available online. Huggingface's transformers <ref type="bibr">(Wolf et al., 2019)</ref> was used for finetuning with a learning rate of 2e-5 and weight decay of 0.01. A batch size of 128 was used for small and base models, and a batch size of 48 for large models. Further, Meditron-7B <ref type="bibr">(Chen et al., 2023b)</ref> was finetuned for 5 epochs with a context length of 930, 2e-6 learning rate, and batch size of 8/16 (molecule/caption generation). Models are released online. Results for captioning are reported in Tables 2, 3 and 4. Tables <ref type="table">5</ref>, and <ref type="table">6</ref> shows results for molecule generation.</p><p>Overall, the dataset proves to be fairly challenging for these naively finetuned models. On captioning, Meditron-7B achieves a maximum overall F-1 score of 16.81 for property identification (Table <ref type="table">3</ref>). However, overall it has a much higher precision than recall, indicating the model only labels</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Ground Truth Input Meditron MolT5-large MolT5-base MolT5-small</head><p>The molecule is a luminescent member of the organic light-emitting class.</p><p>Cc1ccc <ref type="bibr">(-c2ccc(-c3ccc(-c4ccc(-c5ccccc5)</ref></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Invalid</head><p>The molecule is both a platelet aggregation inhibitor and a cell adhesion inhibitor.</p><p>Cc1ccc <ref type="bibr">(-c2ccc(-c3ccc(-c4ccc(-c5ccccc5)</ref></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Invalid</head><p>The molecule is a muscarinic agonist that impacts pain treatment and is both alzheimer's treatment and anxiety treatment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Invalid</head><p>The molecule is a jak2 inhibitor and is cancer treatment.</p><p>The molecule is both a anti psychotic and a nmda antagonist.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Invalid</head><p>The molecule is a factor ixa inhibitor, a factor xa inhibitor, and anti thrombotic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Invalid</head><p>The molecule is a flavoring agent and a nutrient, as well as nutty and green. a molecule with a certain property when having higher confidence. Certain classes of molecules, such as X-icides, are never identified (Table <ref type="table">4</ref>). Other classes, such as toxins or electricity, show emergent behavior as model size scales. Interestingly, the models appear to be fairly capable at linking molecules to certain diseases or cancers. We find that, likely due to poor performance on individual properties, only the largest model succeeds on predicting held-out combos, and with poor results. Additionally, we find that the Text2Mol metric, as trained on ChEBI-20, shows poor domain transfer to L+M-24.</p><p>The models are able to capture a number of useful properties, such as electroluminescence, diabetes treatment, non-alcoholic fatty liver disease, and emulsifiers. In some cases, the model captures important characteristics about the molecule but uses differing language. This poses a challenge for our evaluation metrics. For example, a molecule identified in the ground truth as an anti tumor agent is identified as being a cancer treatment by the model. In particular, the models appear to struggle with rarer properties, which are common in our dataset formulation and in the chemical domain as a whole. They also struggle with identifying molecule-protein interactions (e.g., "monoamine reuptake inhibitor"), although Meditron shows a large performance jump.</p><p>For the molecule generation task, we also find the dataset to be challenging. We show results generated by different models on never-before-seen property combinations in Figures <ref type="figure">2</ref> and <ref type="figure">3</ref>. We believe the difficulty is for two reasons. First, common property combinations may have structurally very different molecules which exhibit those properties, making evaluation difficult. Second, the model may not grasp rare properties well. Overall, this results in the naively finetuned models producing similar outputs to many different prompts. Further, as expected, performance falls on unseen property combinations and larger models prove</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Ground Truth Input Meditron MolT5-large MolT5-base MolT5-small</head><p>The molecule is a cytokine inhibitor, protein kinase inhibitor, pdk1 inhibitor, mk2 inhibitor.</p><p>Cc1ccc <ref type="bibr">(-c2ccc(-c3ccc(-c4ccc(-c5ccc(-c6ccccc6)</ref></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Invalid</head><p>The molecule is a aldose reductase inhibitor, a cox2 inhibitor, and anti inflammatory.</p><p>Cc1ccc <ref type="bibr">(-c2ccc(-c3ccc(-c4ccc(-c5ccccc5)</ref></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Invalid</head><p>The molecule has Odor-like bitter almonds. When heated to decomposition it emits toxic fumes of nitrogen oxides and cyanides.</p><p>The molecule is a 5lipoxygenase inhibitor that impacts inflammatory disease treatment.</p><p>The molecule is alzheimer's treatment and it impacts epilepsy treatment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Invalid</head><p>The molecule is a renin inhibitor, hiv protease inhibitor, hypertension treatment, betasecretase inhibitor.</p><p>It has effects on both low voltage and luminous efficiency. more effective (Table <ref type="table">6</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Future Directions</head><p>Overall, this dataset proves to be quite challenging. We find that some specific properties in particular are challenging for the model. This may be because the model understands these properties, but is unwilling to use them in its descriptions due to the training procedure. This limitation may be addressed with more sophisticated decoding algorithms or by better finetuning methods. Future work will also likely benefit from incorporating other modalities, such as proteins, to provide better understanding to the model for some property types. Notably, certain properties display what may be emergent behavior; scaling training data or model size may yield non-linear improvements.</p><p>In this dataset, we focus on composition, abstraction, and function. Future work may also wish to integrate other recent trends: instruction-following and dialogue <ref type="bibr">(Fang et al., 2023;</ref><ref type="bibr">Cao et al., 2023;</ref><ref type="bibr">Zeng et al., 2023;</ref><ref type="bibr">Zhao et al., 2024;</ref><ref type="bibr">Zhang et al., 2024;</ref><ref type="bibr">Yu et al., 2024</ref><ref type="bibr">), tool use (Boiko et al., 2023;</ref><ref type="bibr">Bran et al., 2023)</ref>, additional molecule representations (e.g., 3D <ref type="bibr">(Tang et al., 2023)</ref>), additional modalities <ref type="bibr">(Xu et al., 2023)</ref>, or molecule editing <ref type="bibr">(Su et al., 2022)</ref>. Further, we note the need for improved evaluation metrics, especially in the case of molecule generation for function where there may be many possible outputs. Specific methods for improving compositionality may be another fruitful avenue for research <ref type="bibr">(Yellinek et al., 2023)</ref>. It may also be interesting to use molecule-language instruction-following models within larger search frameworks, such as ChemReasoner <ref type="bibr">(Sprueill et al., 2023</ref><ref type="bibr">(Sprueill et al., , 2024))</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Conclusion</head><p>In this manuscript, we describe the process for creating the L+M-24 dataset. L+M-24 is designed to focus on three key benefits of natural language in molecule design: compositionality, functionality, and abstraction. It is the featured shared task at the First Language + Molecules Workshop at ACL 2024.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Prompts and examples for GPT4</head><p>&#8226; Prompts: You are an expert in the chemical domain whose task is to create templates to describe the properties of molecules. You will be challenged with a list of different cases. Each case willl have a list of **templates**, and a **question**. Each template will describe certain properties. Your goal is to generate a new template in a sentence based on all the previous templates.</p><p>&#8226; </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Additional Dataset Statistics</head><p>Here, we give a brief description of properties in the dataset. </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>The dataset, finetuned baseline, and evaluation code are released publicly at github.com/language-plus-molecules/LPM-</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>24-Dataset through HuggingFace.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2"><p>To convert these properties to natural language, we follow a template-based procedure using GPT-4 (OpenAI, 2023) generated compositional templates.2 obtained via personal communication.</p></note>
		</body>
		</text>
</TEI>
