<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Fine-Grained Named Entity Recognition with Distant Supervision in COVID-19 Literature</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>12/16/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10279808</idno>
					<idno type="doi">10.1109/BIBM49941.2020.9313126</idno>
					<title level='j'>BIBM'20, IEEE Int. Conf. on Bioinformatics and Biomedicine, Dec 2020</title>
<idno></idno>
<biblScope unit="volume">2020</biblScope>
<biblScope unit="issue">1</biblScope>					

					<author>Xuan Wang</author><author>Xiangchen Song</author><author>Bangzheng Li</author><author>Kang Zhou</author><author>Qi Li</author><author>Jiawei Han</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Biomedical named entity recognition (BioNER) is a fundamental step for mining COVID-19 literature. Existing BioNER datasets cover a few common coarse-grained entity types (e.g., genes, chemicals, and diseases), which cannot be used to recognize highly domain-specific entity types (e.g., animal models of diseases) or emerging ones (e.g., coronaviruses) for COVID-19 studies. We present CORD-NER, a fine-grained named entity recognized dataset of COVID-19 literature (up until May 19, 2020). CORD-NER contains over 12 million sentences annotated via distant supervision. Also included inCORD-NER are 2,000 manually-curated sentences as a test set for performance evaluation. CORD-NER covers 75 fine-grained entity types. In addition to the common biomedical entity types, it covers new entity types specifically related to COVID-19 studies, such as coronaviruses, viral proteins, evolution, and immune responses. The dictionaries of these fine-grained entity types are collected from existing knowledge bases and human-input seed sets. We further present DISTNER, a distantly supervised NERmodel that relies on a massive unlabeled corpus and a collection of dictionaries to annotate the COVID-19 corpus. DISTNER provides a benchmark performance on the CORD-NER test set for future research.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>COVID-19 is an infectious disease that was first identified in December 2019 and has since spread globally, resulting in the 2019-2020 coronavirus pandemic. Scholarly literature about COVID-19, SARS-CoV-2, and the coronavirus group has been pouring into the COVID-19 Open Research Dataset (CORD-19) <ref type="bibr">[9]</ref> just in the past few months. It is critical to automatically extract the most relevant and accurate information from this large-scale and fasting growing COVID-19 literature corpus to facilitate COVID-19 studies.</p><p>Biomedical named entity recognition (BioNER) is a fundamental step for mining COVID-19 literature. Existing BioNER datasets (e.g., BC5CDR <ref type="bibr">[10]</ref>, JNLPBA <ref type="bibr">[2]</ref>, and BIONLP13CG <ref type="bibr">[6]</ref>) cover a few common coarse-grained entity types (e.g., genes, chemicals, and diseases), which cannot be used to recognize highly domain-specific (e.g., animal models of diseases) or emerging entity types (e.g., coronaviruses) for COVID-19 studies.</p><p>We present CORD-NER, a fine-grained named entity recognized dataset of COVID-19 literature (up until May 19, 2020). CORD-NER contains over 12 million sentences annotated via distant supervision. Also included in CORD-NER are 2,000 manually-curated sentences as a test set for performance evaluation. CORD-NER covers 75 fine-grained entity types. In addition to the common biomedical entity types, it covers new entity types specifically related to COVID-19 studies, such as coronaviruses, viral proteins, evolution, and immune responses. These fine-grained entity types are highly related to research on COVID-19 related virus, spreading mechanisms, and potential vaccines. The dictionaries of these fine-grained entity types are collected from existing knowledge bases and human-input seed sets.</p><p>We further present DISTNER, a distantly supervised NER model that relies on the massive unlabeled corpus and dictionaries to annotate the COVID-19 corpus. DISTNER achieves high performance with dictionaries of different scales (from dozens to thousands of entities). It leverages a dictionaryguided representation learning model to expand the small dictionaries and further incorporates the newly-learned word embeddings into a NER neural model training. DISTNER automatically annotates the COVID-19 corpus with high quality and provides a benchmark performance on the CORD-NER test set for future research. Based on the DISTNER model, CORD-NER allows adding new documents as well as new entity types when needed by adding dozens of seeds as the input examples. CORD-NER can help the NLP community for downstream applications, such as relation extraction, knowledge graph construction, and information retrieval, in COVID-19 literature.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. CORD-NER DATASET</head><p>In this section, we first introduce how we collected the input corpus and the fine-grained entity type dictionaries for CORD-NER. Then we introduce DISTNER, the distantly supervised NER model used to annotate the input corpus.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Corpus</head><p>The input corpus is generated from the CORD-19 dataset (up until May 19, 2020). We first combined the title and abstract of each paper in the meta-data file with their corresponding full-text from all the data sources (CZI, PMC, bioRxiv, and medRxiv) in CORD-19. This input corpus contains 12,698,615 sentences from 128,492 documents. Then we conducted automatic phrase mining and tokenization on the input corpus using AutoPhrase <ref type="bibr">[7]</ref>. This tokenized corpus is used for further NER annotations. We observed that incorporating the AutoPhrase tokenization results can improve the distantly supervised NER performance as it provides additional information for entity boundary detection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Fine-Grained Entity Type Dictionaries</head><p>For each entity type to be annotated, we collect a dictionary containing a list of entities belonging to that type.</p><p>Existing Knowledge Bases. We use UMLS<ref type="foot">foot_1</ref> knowledge base to collect the large-scale dictionaries. We collect the latest version of UMLS (the year 2020) that contains 127 finegrained entity types. We further merged some fine-grained types into their more coarse-grained parent types according to the corpus counts and suggestions from domain experts. It results in 48 fine-grained types in UMLS used for our entity annotation. Each UMLS type includes thousands of entities as the input dictionary. Human-Input Seed Sets. In addition to the types in UMLS, biomedical scientists and medical doctors are interested in some additional entity types specifically related to COVID-19 studies. These types are either new or too specific that have not been incorporated in the UMLS knowledge base. We included nine new types (coronaviruses, viral proteins, livestocks, wildlifes, evolution, physical science, substrates, materials, and immune responses) defined by the scientists and doctors. For each new type, the scientists and doctors provide 20 seed entities as the input dictionary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Distantly Supervised NER Model</head><p>Based on the entity type dictionaries we collected in different scales (from dozens to thousands of entities), we propose DISTNER, a distantly supervised NER model that can automatically annotate the CORD-19 corpus.</p><p>Dictionary-Guided Representation Learning. The first step of DISTNER is dictionary-guide embedding learning. It takes the input dictionaries (Section II-B) as weak supervision and jointly embeds the entities, types and words into a shared space. The entities and types are from the input dictionaries (Section II-B). The words are from the input corpus (Section II-A). Note that the words here also include the phrases that we previously discovered during corpus tokenization.</p><p>To achieve the goal of making the words form discrete clusters around the types, we learn the joint embedding of entities, types and words by satisfying two criteria: Coherence and Discriminativeness. Coherence means that the entities should have embeddings that are close to their corresponding types' embeddings. Discriminativeness means that the embeddings of different types should be far apart from each other. Inspired by CatE <ref type="bibr">[3]</ref>, a category-guided embedding learning method, we first formulate a joint type and text generative process under the guidance of the input dictionary. Then we cast the learning of the generative process as a dictionary-guided embedding learning model.</p><p>The input to our dictionary-guided embedding learning model consists of two parts: (1) a set of dictionaries {D t }, where each dictionary D t = {e 1 , e 2 , , . . . , e |Dt| } contains entities e for the type t &#8712; T , and (2) a text corpus containing sentences s = [w 1 , w 2 , . . . , w |s| ], where each sentence consists of words and entities that can be matched to the sentence. For ease of notation, we use w to denote both words and entities in the sentence.</p><p>We assume a joint type and text generative process in two steps: (1) each type t is generated conditioned on the semantics of the entities e in the dictionary D t ; and (2) surrounding words and entities C(w i , h) of a word/entity w i in a sentence s are generated conditioned on the semantics of the center word/entity w i , where C(w i , h) = {w j : i-h &#8804; j &#8804; i+h, i = j}, h is the context window size. Putting the above two steps together, we have the following expression for the likelihood of the joint type and text generative process:</p><p>The first part t e&#8712;Dt p(t|e) of the likelihood J indicates the probability of observing all the types (e.g., "Coronavirus") given the entities (e.g., "SARS" and "MERS") in our input dictionaries. The second part s wi&#8712;s p(C(w i , h)|w i ) of the likelihood J indicates the probability of observing the input corpus.</p><p>Then we formulate the optimization of the objective in Eq. ( <ref type="formula">1</ref>) as an embedding learning problem. Similar to <ref type="bibr">[4]</ref>, we define the two conditional probabilities in Eq. ( <ref type="formula">1</ref>) via loglinear models in the embedding space:</p><p>where t is the embedding vector of the type t; e is the embedding vector of the entity e; and w is the embedding vector of the word or entity w.</p><p>Eqs. ( <ref type="formula">2</ref>) and ( <ref type="formula">3</ref>) can be directly plugged into Eq. ( <ref type="formula">1</ref>) to train the joint type and text embeddings. To this end, we have enforced the first Coherence criterion in Eq. ( <ref type="formula">2</ref>). Then we show how to satisfy the second Discriminativeness criterion.</p><p>Let p e = [p(t 1 |e), ..., p(t |T | |e)] be the probability distribution of e over all types. To satisfy the second Discriminativeness criterion, if an entity e is known to belong to type t, p e computed from Eq. ( <ref type="formula">2</ref>) should become a one-hot vector l e (i.e., the type of e) with p(t|e) = 1. To achieve this property, we minimize the KL divergence from each seed entity's distribution p(t|e) to its corresponding discrete delta distribution l e . Formally, given a dictionary of seed entities D t for type t, the first term in Eq. ( <ref type="formula">1</ref>) is implemented as:</p><p>From the embedding learning perspective, Eq. ( <ref type="formula">4</ref>) is equivalent to a cross-entropy regularization loss, encouraging the type embeddings to become discriminative in the embedding space and are far apart from each other.</p><p>Finally, based on the newly-learned representations of the types and words, we expand each type's dictionary with the words that have high embedding cosine similarity (&#8805; 0.5) with its type embedding. Note that the words here also include the phrases that we previously discovered with AutoPhrase during corpus tokenization. We further incorporate the newly-learned word embeddings into the NER neural model training.</p><p>NER Neural Model. We adopt the AutoNER <ref type="bibr">[8]</ref> neural model as the benchmark distant NER model on the CORD-19 corpus. The neural model learning is divided into two steps: entity span detection and entity typing.</p><p>For entity span detection, a binary classifier is built to determine whether a connection between two adjacent tokens should be labeled as Break or Tie. A BiLSTM layer is utilized to encode the character and word embeddings (learned from the Dictionary-Guided Representation Learning step) to predict whether the connection y i between tokens w i-1 and w i is Break. Then the output of the BiLSTM layer will be concatenated as one vector u i and fed into a Sigmoid layer:</p><p>where y i is the label between the i-th and its previous tokens, &#963; is the sigmoid function, and w is the sigmoid layer's parameter. The loss function of entity span detection:</p><p>where l(&#8226;, &#8226;) is the logistic loss.</p><p>After the entity boundary is determined, each candidate entity span (tokens within two adjacent Break) is represented with a new vector v j and fed into a Softmax layer to determine its entity type:</p><p>, where t j is the label of candidate entity span j and T = T &#8746; {None}. The loss function of entity type prediction:</p><p>where H(&#8226;, &#8226;) is the cross entropy function and p(&#8226;|v i , T ) is the supervision distribution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. EVALUATION</head><p>Experimental Setup. Given the input corpus and the expanded dictionaries, we first conduct exact string matching <ref type="bibr">[8]</ref> on a subset corpus of 3,000,000 sentences to generate a distantly labeled training corpus. Conflicted matches are resolved by maximizing the total number of matched tokens on each sentence. We split the distantly labeled training corpus into 9:1 for training and development. We randomly selected another 2,000 sentences from our input corpus and asked domain experts for manual annotation. We use this manually-annotated test set to compare the performance of different BioNER models on the CORD-19 corpus. We compare DISTNER with AutoNER <ref type="bibr">[8]</ref>, the benchmark method for distantly supervised BioNER. We also compare DISTNER with pretrained supervised BioNER models, such as SciSpacy <ref type="bibr">[5]</ref>, a commercial supervised BioNER tool, and SciBERT <ref type="bibr">[1]</ref>, a benchmark method for supervised BioNER. We report the precision, recall, and F1 scores<ref type="foot">foot_2</ref> of each method on our humanannotated test set. Test Set Annotation. Three domain experts annotated each sentence. Due to a large number of fine-grained entity types, we only annotated 7 out of the 75 types in this test set for evaluation and resulted in 2,000 annotated sentences.</p><p>The seven types include genes, chemicals, diseases, signs or symptoms, coronaviruses, evolution, and immune responses. Each pair of annotators reach a substantial agreement with a Fleiss's &#954; of 0.72. Parameters. We used PyTorch for model implementations.</p><p>For the baseline model AutoNER, we use 200-dimension word embeddings<ref type="foot">foot_3</ref> trained on the entire Pubmed database of abstracts and full-text articles together with the Wikipedia corpus. For DISTNER, we use the dictionary-guided word embeddings learned using our dictionary-guided representation learning model. The DISTNER neural model parameter settings are the same as AutoNER. The character embedding dimension is 30, and the hidden state size for both the character-level BiLSTM and word-level BiLSTM is 300. The optimization method is gradient descent with momentum. The batch size and the momentum are set to be 10 and 0.9. The learning rate is set to 0.05. The dropout ratio is set to 0.5. For better stability, a gradient clipping of 5.0 is used.</p><p>Results. Table <ref type="table">I</ref> shows the performance comparison of DIS-TNER and AutoNER, the benchmark distantly supervised BioNER model. We use the original implementation of Au-toNER <ref type="foot">4</ref> and trained the model on our distantly labeled training corpus. Then we evaluate the performance of DISTNER and AutoNER on our test set. DISTNER outperforms AutoNER by a large margin on the F1 scores. The performance gain is more significant when the input dictionary is small (e.g., dictionaries contain 20 seed entities used for types such as coronavirus, evolution, and immune response).   the original input dictionary. We see that both the dictionary expansion and the dictionary-guided word embeddings help improve the DISTNER performance compared to AutoNER. The dictionary-guided word embeddings (DISTNER w/o Exp ) bring a more significant performance improvement compared to dictionary expansion (DISTNER w/o Emb ). The dictionaryguided word embeddings (DISTNER w/o Exp ) improve both the precision and recall significantly, while the expanded dictionary (DISTNER w/o Emb ) introduces an increase in recall but a decrease in precision compared to AutoNER.</p><p>Table <ref type="table">II</ref> shows the performance comparison between DIS-TNER and the fully-supervised BioNER models, SciSpacy and SciBERT. For SciSpacy, we use its published pre-trained models<ref type="foot">foot_5</ref> on both BIONLP13CG <ref type="bibr">[2]</ref> and BC5CDR <ref type="bibr">[10]</ref>. For SciBERT, since it does not release its pre-trained models, we use its SciBERT embeddings <ref type="foot">6</ref> and re-trained the model on BC5CDR. Then we conduct prediction and evaluation on our test set. DISTNER shows better performance on chemical and disease prediction compared to both SciSpacy and SciBERT due to a higher recall. DISTNER also shows better performance for gene prediction compared with SciSpacy trained on BIONLP13CG. We observe that SciSpacy tends to predict most coronaviruses as genes, leading to a very low precision.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. CONCLUSION</head><p>We present CORD-NER, a fine-grained named entity recognized dataset of COVID-19 literature (up until May 19, 2020). CORD-NER contains over 12 million sentences annotated via distant supervision. Also included in CORD-NER are 2,000 manually-curated sentences as a test set for performance evaluation. We further present DISTNER, a distantly supervised NER model that is used to annotate the COVID-19 corpus. DISTNER provides a benchmark performance on the CORD-NER test set for future research. CORD-NER can help other downstream NLP tasks for COVID-19 studies, such as relation extraction, knowledge graph construction, and information retrieval.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Authorized licensed use limited to: University of Illinois. Downloaded on May 05,2021 at 20:53:21 UTC from IEEE Xplore. Restrictions apply.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1"><p>https://www.nlm.nih.gov/research/umls/META3 current semantic types. html</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_2"><p>https://github.com/chakki-works/seqeval</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_3"><p>http://bio.nlplab.org/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_4"><p>https://github.com/shangjingbo1226/AutoNER</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_5"><p>https://allenai.github.io/scispacy/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_6"><p>https://github.com/allenai/scibert</p></note>
		</body>
		</text>
</TEI>
