skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 5:00 PM ET until 11:00 PM ET on Friday, June 21 due to maintenance. We apologize for the inconvenience.

This content will become publicly available on July 10, 2024

Title: ReactIE: Enhancing Chemical Reaction Extraction with Weak Supervision
Structured chemical reaction information plays a vital role for chemists engaged in laboratory work and advanced endeavors such as computer-aided drug design. Despite the importance of extracting structured reactions from scientific literature, data annotation for this purpose is cost-prohibitive due to the significant labor required from domain experts. Consequently, the scarcity of sufficient training data poses an obstacle to the progress of related models in this domain. In this paper, we propose REACTIE, which combines two weakly supervised approaches for pre-training. Our method utilizes frequent patterns within the text as linguistic cues to identify specific characteristics of chemical reactions. Additionally, we adopt synthetic data from patent records as distant supervision to incorporate domain knowledge into the model. Experiments demonstrate that REACTIE achieves substantial improvements and outperforms all existing baselines.  more » « less
Award ID(s):
1956151 1741317 1704532
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Association for Computational Linguistics
Date Published:
Page Range / eLocation ID:
12120 to 12130
Subject(s) / Keyword(s):
["ReactIE, information extraction, weakly supervised approach, machine learning, text mining, scientific literature"]
Medium: X
Toronto, Canada
Sponsoring Org:
National Science Foundation
More Like this
  1. Scientific literature analysis needs fine-grained named entity recognition (NER) to provide a wide range of information for scientific discovery. For example, chemistry research needs to study dozens to hundreds of distinct, fine-grained entity types, making consistent and accurate annotation difficult even for crowds of domain experts. On the other hand, domain-specific ontologies and knowledge bases (KBs) can be easily accessed, constructed, or integrated, which makes distant supervision realistic for fine-grained chemistry NER. In distant supervision, training labels are generated by matching mentions in a document with the concepts in the knowledge bases (KBs). However, this kind of KB-matching suffers from two major challenges: incomplete annotation and noisy annotation. We propose ChemNER, an ontology-guided, distantly-supervised method for fine-grained chemistry NER to tackle these challenges. It leverages the chemistry type ontology structure to generate distant labels with novel methods of flexible KB-matching and ontology-guided multi-type disambiguation. It significantly improves the distant label generation for the subsequent sequence labeling model training. We also provide an expert-labeled, chemistry NER dataset with 62 fine-grained chemistry types (e.g., chemical compounds and chemical reactions). Experimental results show that ChemNER is highly effective, outperforming substantially the state-of-the-art NER methods (with .25 absolute F1 score improvement). 
    more » « less
  2. Machine learning on graph structured data has attracted much research interest due to its ubiquity in real world data. However, how to efficiently represent graph data in a general way is still an open problem. Traditional methods use handcraft graph features in a tabular form but suffer from the defects of domain expertise requirement and information loss. Graph representation learning overcomes these defects by automatically learning the continuous representations from graph structures, but they require abundant training labels, which are often hard to fulfill for graph-level prediction problems. In this work, we demonstrate that, if available, the domain expertise used for designing handcraft graph features can improve the graph-level representation learning when training labels are scarce. Specifically, we proposed a multi-task knowledge distillation method. By incorporating network-theory-based graph metrics as auxiliary tasks, we show on both synthetic and real datasets that the proposed multi-task learning method can improve the prediction performance of the original learning task, especially when the training data size is small. 
    more » « less
  3. Specialized domain knowledge is often necessary to ac- curately annotate training sets for in-depth analysis, but can be burdensome and time-consuming to acquire from do- main experts. This issue arises prominently in automated behavior analysis, in which agent movements or actions of interest are detected from video tracking data. To reduce annotation effort, we present TREBA: a method to learn annotation-sample efficient trajectory embedding for be- havior analysis, based on multi-task self-supervised learn- ing. The tasks in our method can be efficiently engineered by domain experts through a process we call “task program- ming”, which uses programs to explicitly encode structured knowledge from domain experts. Total domain expert effort can be reduced by exchanging data annotation time for the construction of a small number of programmed tasks. We evaluate this trade-off using data from behavioral neuro- science, in which specialized domain knowledge is used to identify behaviors. We present experimental results in three datasets across two domains: mice and fruit flies. Using embeddings from TREBA, we reduce annotation burden by up to a factor of 10 without compromising accuracy com- pared to state-of-the-art features. Our results thus suggest that task programming and self-supervision can be an ef- fective way to reduce annotation effort for domain experts. 
    more » « less
  4. Mathematical reasoning, a core ability of human intelligence, presents unique challenges for machines in abstract thinking and logical reasoning. Recent large pre-trained language models such as GPT-3 have achieved remarkable progress on mathematical reasoning tasks written in text form, such as math word problems (MWP). However, it is unknown if the models can handle more complex problems that involve math reasoning over heterogeneous information, such as tabular data. To fill the gap, we present Tabular Math Word Problems (TABMWP), a new dataset containing 38,431 open-domain grade-level problems that require mathematical reasoning on both textual and tabular data. Each question in TABMWP is aligned with a tabular context, which is presented as an image, semi-structured text, and a structured table. There are two types of questions: free-text and multi-choice, and each problem is annotated with gold solutions to reveal the multi-step reasoning process. We evaluate different pre-trained models on TABMWP, including the GPT-3 model in a few-shot setting. As earlier studies suggest, since few-shot GPT-3 relies on the selection of in-context examples, its performance is unstable and can degrade to near chance. The unstable issue is more severe when handling complex problems like TABMWP. To mitigate this, we further propose a novel approach, PROMPTPG, which utilizes policy gradient to learn to select in-context examples from a small amount of training data and then constructs the corresponding prompt for the test example. Experimental results show that our method outperforms the best baseline by 5.31% on the accuracy metric and reduces the prediction variance significantly compared to random selection, which verifies its effectiveness in selecting in-context examples. 
    more » « less
  5. null (Ed.)
    Photochemical reactions are widely used by academic and industrial researchers to construct complex molecular architectures via mechanisms that often require harsh reaction conditions. Photodynamics simulations provide time-resolved snapshots of molecular excited-state structures required to understand and predict reactivities and chemoselectivities. Molecular excited-states are often nearly degenerate and require computationally intensive multiconfigurational quantum mechanical methods, especially at conical intersections. Non-adiabatic molecular dynamics require thousands of these computations per trajectory, which limits simulations to ∼1 picosecond for most organic photochemical reactions. Westermayr et al. recently introduced a neural-network-based method to accelerate the predictions of electronic properties and pushed the simulation limit to 1 ns for the model system, methylenimmonium cation (CH 2 NH 2 + ). We have adapted this methodology to develop the Python-based, Python Rapid Artificial Intelligence Ab Initio Molecular Dynamics (PyRAI 2 MD) software for the cis – trans isomerization of trans -hexafluoro-2-butene and the 4π-electrocyclic ring-closing of a norbornyl hexacyclodiene. We performed a 10 ns simulation for trans -hexafluoro-2-butene in just 2 days. The same simulation would take approximately 58 years with traditional multiconfigurational photodynamics simulations. We generated training data by combining Wigner sampling, geometrical interpolations, and short-time quantum chemical trajectories to adaptively sample sparse data regions along reaction coordinates. The final data set of the cis – trans isomerization and the 4π-electrocyclic ring-closing model has 6207 and 6267 data points, respectively. The training errors in energy using feedforward neural networks achieved chemical accuracy (0.023–0.032 eV). The neural network photodynamics simulations of trans -hexafluoro-2-butene agree with the quantum chemical calculations showing the formation of the cis -product and reactive carbene intermediate. The neural network trajectories of the norbornyl cyclohexadiene corroborate the low-yielding syn -product, which was absent in the quantum chemical trajectories, and revealed subsequent thermal reactions in 1 ns. 
    more » « less