<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Clinically relevant pretraining is all you need</title></titleStmt>
			<publicationStmt>
				<publisher>Oxford Academic</publisher>
				<date>06/21/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10513353</idno>
					<idno type="doi">10.1093/jamia/ocab086</idno>
					<title level='j'>Journal of the American Medical Informatics Association</title>
<idno>1527-974X</idno>
<biblScope unit="volume">28</biblScope>
<biblScope unit="issue">9</biblScope>					

					<author>Oliver J Bear_Don’t_Walk_IV</author><author>Tony Sun</author><author>Adler Perotte</author><author>Noémie Elhadad</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<title>Abstract</title> <p>Clinical notes present a wealth of information for applications in the clinical domain, but heterogeneity across clinical institutions and settings presents challenges for their processing. The clinical natural language processing field has made strides in overcoming domain heterogeneity, while pretrained deep learning models present opportunities to transfer knowledge from one task to another. Pretrained models have performed well when transferred to new tasks; however, it is not well understood if these models generalize across differences in institutions and settings within the clinical domain. We explore if institution or setting specific pretraining is necessary for pretrained models to perform well when transferred to new tasks. We find no significant performance difference between models pretrained across institutions and settings, indicating that clinically pretrained models transfer well across such boundaries. Given a clinically pretrained model, clinical natural language processing researchers may forgo the time-consuming pretraining step without a significant performance drop.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>INTRODUCTION</head><p>The electronic health record (EHR) contains a wealth of rich, unstructured patient health data, such as clinical text. Natural language processing (NLP) techniques allow for clinical text to be leveraged in a multitude of scenarios, such as information extraction, <ref type="bibr">[1]</ref><ref type="bibr">[2]</ref><ref type="bibr">[3]</ref> understanding clinical workflow, <ref type="bibr">3,</ref><ref type="bibr">4</ref> decision support, <ref type="bibr">5</ref> and question answering. <ref type="bibr">6</ref> NLP models often suffer reduced performance when applied across institutions <ref type="bibr">7,</ref><ref type="bibr">8</ref> or specialties. <ref type="bibr">9</ref> Drops in performance are due in part to differences in vocabulary, content, and style that manifest along axes such as syntax, <ref type="bibr">[10]</ref><ref type="bibr">[11]</ref><ref type="bibr">[12]</ref> semantics, <ref type="bibr">13,</ref><ref type="bibr">14</ref> and workflow procedures. <ref type="bibr">7</ref> Historically, transferred NLP models overcome clinical institution differences by retraining models from scratch <ref type="bibr">7</ref> or using domain adaptation techniques <ref type="bibr">15</ref> for the downstream task of interest.</p><p>Recently, pretraining has led to robust methods for creating generalizable models that can be transferred to downstream tasks across genres and domains. <ref type="bibr">[16]</ref><ref type="bibr">[17]</ref><ref type="bibr">[18]</ref><ref type="bibr">[19]</ref><ref type="bibr">[20]</ref> In contrast to traditional approaches, in which a new supervised task is learned from scratch on a training set, a pretrained model can leverage parameters that have already been trained to a different (simpler and often self-supervised) task. The intuition for this approach is that some of these parameters can generalize to the new task. In this way, previous experiences can be built upon. Furthermore, pretraining is inspired by the idea that certain features are learned across multiple tasks. For example, a model might learn how certain words relate to one another regardless of the task. <ref type="bibr">21</ref> The fact that pretraining learns parameters that would otherwise need to be relearned by a new task makes pretraining especially useful when labeling data is time-consuming and expensive.</p><p>In traditional training zapproaches, training occurs in a single phase in which a model is initialized and trained on a task. This approach does not leverage shared information between tasks. In contrast, pretraining and transferring a model incurs a one-time cost of pretraining a model to extract generalizable features and transferring this same model to multiple tasks. Instead of relearning all parameters, pretrained parameters are updated to the specific task at hand. A visual depiction of the difference between approaches can be seen in Figure <ref type="figure">1</ref>.</p><p>While pretraining has been commonly used to learn low level parameters like word embeddings, <ref type="bibr">[22]</ref><ref type="bibr">[23]</ref><ref type="bibr">[24]</ref> recent advances have shown that pretraining is a powerful approach to learning higher level, generalizable linguistic representations. <ref type="bibr">[16]</ref><ref type="bibr">[17]</ref><ref type="bibr">[18]</ref><ref type="bibr">[19]</ref><ref type="bibr">[20]</ref> Many kinds of approaches to pretraining have been tried, but language modeling tasks have proven to be generalizable to many other NLP tasks. <ref type="bibr">25</ref> Specifically, masked language modeling, in which a model learns to identify masked words given the surrounding context has been a viable pretraining task. <ref type="bibr">16</ref> Leveraging language modeling tasks, like masked language modeling, during pretraining is especially useful, as there are large amounts of text data to be leveraged that do not require any labeling. A more detailed explanation of language modeling can be found in the Supplementary Appendix A. Through language modeling-based pretraining, unlabeled data can be used to improve performance on a new task instead of the potentially costly step of labeling more task specific data.</p><p>For a variety of pretrained models, pretraining was initially performed using Wikipedia, <ref type="bibr">16,</ref><ref type="bibr">17,</ref><ref type="bibr">19,</ref><ref type="bibr">20</ref> the Book Corpus, <ref type="bibr">16,</ref><ref type="bibr">18,</ref><ref type="bibr">20</ref> the One Billion Word Benchmark, <ref type="bibr">19</ref> news articles, <ref type="bibr">20</ref> and Web snippets. <ref type="bibr">19,</ref><ref type="bibr">20</ref> These corpora represent a more general domain and contain a wide variety of topics without specializing in any single topic. Pretraining on such corpora has worked well for many tasks in the general do-main; however, the clinical domain containsspecializations like language, abbreviations, grammar, and semantics are not encountered in the general domain, leaving room for domain-specific pretraining. This has led to clinical domain variants of pretrained models, <ref type="bibr">[26]</ref><ref type="bibr">[27]</ref><ref type="bibr">[28]</ref> which have outperformed their general domain counterparts on a variety of clinical NLP tasks such as readmission prediction, <ref type="bibr">27</ref> named entity recognition, <ref type="bibr">26,</ref><ref type="bibr">28</ref> reason for visit extraction, <ref type="bibr">29</ref> natural language entailment, <ref type="bibr">26,</ref><ref type="bibr">28</ref> and medication extraction. <ref type="bibr">30</ref> Beyond differences between domains, heterogeneity within the clinical domain such as geography, clinical setting, patient population, and de-identification status manifest along multiple axes such as syntax, <ref type="bibr">[10]</ref><ref type="bibr">[11]</ref><ref type="bibr">[12]</ref> semantics, <ref type="bibr">13,</ref><ref type="bibr">14</ref> and workflow procedures. <ref type="bibr">7</ref> It is generally accepted that NLP model performance may degrade when evaluated on data with a different distribution than what had been trained on and is nontrivial to deal with. <ref type="bibr">7,</ref><ref type="bibr">15</ref> Many pretrained models in the clinical domain available for download are pretrained using the Medical Information Mart for Intensive Care-III(MIMIC-III) dataset. <ref type="bibr">31</ref> Given the differences between clinical institutions and settings, we ask the following questions. Is a single round of clinically relevant pretraining sufficient to generalize across multiple clinical institutions and settings? Furthermore, can institution-or settingspecific pretraining improve downstream task performance over pretraining at a different institution or setting? These questions are relevant for clinical NLP researchers looking to apply clinically relevant pretrained models to their own data, in which pretraining their own model might be prohibitively expensive. Using a meticulous experimental design, we explore whether institutional differences impact Figure <ref type="figure">1</ref>. An overview of the pretraining and transfer phases vs a traditional training approach. The traditional training approach initializes a new model for each task without sharing knowledge between tasks. In contrast, during the pretraining phase a model learns parameters that can generalize to other natural language processing tasks by learning a pretraining task. Pretraining datasets can be large, allowing tasks with smaller datasets to take advantage of the "warm start" provided through pretraining. Pretraining is a one-time cost, allowing for a pretrained model to be transferred to multiple new tasks. During the transfer phase, the pretrained model is updated to perform a new task and can result in better performance with less data than if the model was randomly initialized.</p><p>performance on downstream tasks when pretraining at the same or a different institution as the downstream task. Our results indicate that institution-or setting-specific pretraining does not meaningfully improve performance and clinically relevant pretraining is all you need.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>MATERIALS AND METHODS</head><p>Using EHR data from 2 institutions, we assess the impact of pretraining on different institutional and setting data on our downstream document classification tasks. We collect 1 general and 2 intensive care unit (ICU) corpora from the 2 institutions and 2 downstream task datasets from each institution. Using the 3 pretraining corpora, we create 3 pretrained models, and evaluate the performance of each model trained on each downstream task. Here, training refers to updating a model's weights from those learned during pretraining to a new task. In this work we focus on using the Bidirectional Encoder Representations from Transformers (BERT) <ref type="bibr">16</ref> model, as it has been shown to be a strong baseline for state-of-theart pretrained models. Models are pretrained using the pretraining tasks outlined in the original BERT article. <ref type="bibr">16</ref> A more detailed explanation of the pretraining methods used in this work can be found in the Supplementary Appendix A. The proposed experimental design, outlined in Figure <ref type="figure">2A</ref> and 2B, allows us to measure the impact of the pretraining and downstream task data come from the same institution or setting on. We measure impact using downstream task performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Datasets for pretraining</head><p>We leverage 3 corpora from 2 different institutions. The first institutional data is a collection of ICU clinical data from Beth Israel Deaconess Medical Center <ref type="bibr">31</ref> between 2001 and 2012 (MIMIC). The second institutional data is from Columbia University Irving Medical Center (CUIMC) between 2005 and 2015.</p><p>One general clinical corpus and 2 ICU specific corpora were generated. GEN-C is a random selection of notes from CUIMC without any specification for setting. ICU-M is a random selection of notes from the ICU-specific dataset MIMIC, while ICU-C is a random selection of ICU notes from CUIMC. In this work, we control for the number of tokens and training examples in each corpus. The number of tokens in each corpus is used as a statistic to represent how much BERT has been pretrained according to the original authors of BERT. <ref type="bibr">16</ref> In order to avoid data leakage, any data from the test sets of the downstream task were removed from the pretraining corpora. Pretraining demographic (raceand ethnicity data have been merged between MIMIC and CUIMC which collected this information differently; specifically, the Hispanic or Latinx category for Gen-C and ICU-C is not mutually exclusive from the ethnic categories, while ICU-M kept this category mutually exclusive) and text information can be found in Tables <ref type="table">1</ref> and<ref type="table">2</ref>. More information about the pretraining data can be found in Supplementary Appendix B.</p><p>The 3 corpora, GEN-C, ICU-M, and ICU-C, allow for model performance comparisons on downstream tasks while either varying or holding constant the pretraining setting or institution. Models pretrained using GEN-C and ICU-C data are examples of pretrained models at the same institution but different settings. While models pretrained using ICU-C and ICU-M are examples of models pretrained with same setting but different institutions. Finally, models pretrained using GEN-C and ICU-M are examples of different institutions and different settings. All 3 of these scenarios provide insight into the importance of pretraining models at the same institution or setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Datasets for downstream tasks</head><p>Two multilabel document classification tasks were chosen to train and evaluate on. International Classification of Diseases-Ninth Revision (ICD-9) code classification was chosen, as ICD-9 codes are prevalent across many institutions and represents tasks with large but biased datasets. Social determinants of health (SDH) classification was chosen as a representative task for scenarios with smaller datasets. More importantly, these 2 downstream tasks were chosen as they can vary greatly between institutions or settings. SDH have been shown to have high lexical variation even when discussing the same concepts and can vary greatly by population and geography. <ref type="bibr">32,</ref><ref type="bibr">33</ref> For example, homelessness can be indicated by naming a local homeless shelter in which a patient resides, which would not be a readily identifiable indicator for homelessness at other institutions. Furthermore, an SDH task is a practical example for pretraining because labeling SDH is time-consuming and requires expert knowledge. ICD-9 code distribution can also vary by setting, as was shown in our datasets in which the top 50 ICD-9 codes for MIMIC (ICU setting) and CUIMC (setting agnostic) only shared 24 codes. Beyond these reasons, access to these datasets at both intuitions made the experimental design possible.</p><p>Each institution, MIMIC and CUIMC, have training, validation, and test sets for both downstream tasks. ICD codes are extracted from the EHR at each institution and matched to clinical notes. We limit ourselves to classifying the top 50 ICD-9 codes at each institution but do not remove notes without any of these codes. The SDH classification corpora are annotated at a document level with 5 SDH categories: smoking status, illicit drug use status, housing status, sexuality documented;, and sexual history documented. Further details provided in previous work. <ref type="bibr">32</ref> All training, validation, and test splits are made at the patient level to avoid data leakage and each model uses the same training, validation, and test sets. Table <ref type="table">3</ref> summarizes dataset sizes, while Table <ref type="table">4</ref> summarizes dataset demographics (race and ethnicity data have been merged between MIMIC and CUIMC, which collected this information differently; specifically, the Hispanic or Latinx category for CUIMC data is not mutually exclusive from the ethnic categories, while MIMIC kept this category mutually exclusive). More information about the distribution of ICD and Social and Behavioral Determinants of Health (SBDH)codes can be found in Supplementary Appendix C.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experimental design for pretraining</head><p>The BERT pretrained model, <ref type="bibr">16</ref> which is not specialized on clinical or biomedical data, and the PubMedBERT pretrained model, <ref type="bibr">28</ref> which consists of BERT further pretrained on biomedical articles, are the 2 baselines pretrained models for our experiments. Both models were pretrained using the same tasks as the current work but relied on different pretraining data. Practically, further pretraining consists of another round of pretraining, following the BERT procedures.</p><p>Starting from PubMedBERT, we further pretrain 3 different pretrained models: BERT-IM leveraging ICU-M, BERT-GC leveraging GEN-C, and BERT-IC leveraging ICU-C. BERT models further pretrained with biomedical data have been shown to outperform BERT on clinical datasets, <ref type="bibr">26,</ref><ref type="bibr">28</ref> and PubMed presents a much larger dataset than any single clinical dataset, thus making PubMedBERT an ideal initialization for clinically relevant pretraining.</p><p>We used a learning rate of 1 &#194; 10 -4 , a linear warm up schedule of 10% of the total number of steps, a batch size of 500. Finally, each observation was a maximum length of 128 tokens. Following the original pretraining data generation in Devlin et al, <ref type="bibr">16</ref> we concatenated nonoverlapping sentences up to 128 tokens in length.</p><p>Masking was carried out following the masking procedure of the original BERT article <ref type="bibr">16</ref> by masking, replacing, or leaving a token unchanged. These observations consisted of 2 segments, in which 50% of the time the second segment followed the first segment in   the document and the other 50% of the time the second segment was randomly selected from the corpus. The 2 segments were used for the additional pretraining task next sentence prediction. Learning was performed on 1 NVIDIA GeForce RTX 2080 Ti GPU using PyTorch with mixed precision. Each model pretrained for 10 epochs over approximately 2.5 days.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experimental design for downstream tasks</head><p>We train and test 2 downstream tasks, ICD and SDH documentlevel classification. For each task, we want to assess whether aligning the pretraining data and the data used for training the task itself (either by institution or setting) benefit its performance. Given a task and institution, for instance ICD classification and CUIMC, we control for training, validation, and testing sets and compare performance on the task when using different pretrained models. As such, for each corpus and task, there are 5 models that are trained, validated, and tested. We use the validation set to tune the maximum number of epochs (3, 4, 10) used during downstream task training.</p><p>Because both tasks operate at the document level, and because clinical notes are particularly long documents, each clinical note is broken down into up to 10, nonoverlapping, 128-token chunks (n). Following the approach of Huang et al, <ref type="bibr">27</ref> the probability of a document's classification into category k is based on the n classified chunks. Rather than computing an average probability over the n chunks for category k, it also takes into consideration the maximum probability over all chunks using a combination of average and max pooling. Letting P n mean and P n max be the mean and max probability over all n chunks, respectively, the final probability is</p><p>All results are measured in macro-averaged average precision. Average precision summarizes the precision-recall curve by summing the precision at different thresholds weighted by the change in recall from the previous threshold. <ref type="bibr">34</ref> Bootstrapped performances and 95% confidence intervals are calculated by evaluating all models on 1000 bootstrapped test sets. For a given downstream task all models are trained on the same training set and evaluated on the same 1000 bootstrapped test sets. All results are presented using the bootstrapped average performance and 95% confidence intervals.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RESULTS</head><p>All clinical models outperform the baseline models on the 2 downstream tasks of ICD and SDH classification and across institutions. Similarly, PubMedBERT outperforms the original BERT on both downstream tasks and across institutions. We note however that on the SDH downstream task, the confidence intervals of PubMed-BERT and BERT overlap.</p><p>For the ICD classification downstream task, we first note that a model's performance on MIMIC ICD is better than its performance on CUIMC ICD across all models. This is not surprising: while the MIMIC dataset contains only ICU admissions, the CUIMC dataset is more heterogeneous with different settings, leading to higherperplexity tasks.</p><p>We also note that, as expected, the range of the confidence intervals for the different models across tasks and institutions is directly related to the size of training and testing data. That is, the SDH tasks, especially MIMIC SDH, have larger and possibly overlapping confidence intervals due to how small these datasets are compared with the easier and cheaper-to-label ICD datasets.</p><p>Of interest to our original research question, we see that there are slight differences in performance from one setting to the next and from one institution to the next. On balance, taking their confidence intervals into account, all clinical pretrained models yield similar performance to each other across all tasks. These results are summarized in Figures <ref type="figure">3</ref> and<ref type="figure">4</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>DISCUSSION</head><p>The pretrain and transfer paradigm in NLP has led to an explosion of domain-specific models that have achieved state-of-the-art performance across many tasks. In this work, we explored how well pretrained BERT models transfer across institutional and setting boundaries. We confirm previous results that as BERT is pretrained on data closer to the clinical domain, model performance improves. The clinically adapted BERT variants outperform nonclinical BERT models in 3 of 4 experiments, in which in the fourth experiment the performances are tied. Overall, clinical BERT models perform similarly across institutional and setting boundaries regardless of the pretraining setting or institution. To answer the question of whether institution-specific pretraining is helpful, we conclude that there is no statistical difference between clinical BERT variants. There is evidence of a small differences, specifically BERT-IC and BERT-GC on MIMIC-ICD, in which BERT-IC outperforms BERT-GC, while this result is reversed on CUIMC-ICD. This could be evidence of the importance of matching the setting when transferring models to new institutions. However, this difference, and others, are small enough as not to be considered meaningfully different.</p><p>While testing all available clinical BERT models might provide some performance improvement, there is no guarantee of a statisti-cally significant performance increase even if the downstream and pretraining data match across institution or setting. These results raise the question of whether the investment into setting or institution pretraining is warranted. We note that the results presented here are not at odds with the practice of adapting specific NLP task models to new institutions or settings. While it may not be necessary to adapt pretrained models to new institutions or settings at the level of pretraining, it is likely still necessary to adapt such models when they have been specialized to a specific NLP task. It should be noted that this work only explores 2 document classification tasks. There might also be downstream tasks on specialized corpora in which further pretraining does confer a meaningful improvement. In the future, we plan to explore entity-level classification tasks, and performance on the BLUE dataset, <ref type="bibr">28</ref> though we cannot perform a bidirectional comparison in this case without parallel datasets.  </p></div></body>
		</text>
</TEI>
