Pre-trained language models induce dense entity representations that offer strong performance on entity-centric NLP tasks, but such representations are not immediately interpretable. This can be a barrier to model uptake in important domains such as biomedicine.There has been recent work on general interpretable representation learning (Onoe and Durrett, 2020), but these domain-agnostic representations do not readily transfer to the important domain of biomedicine. In this paper, we create a new entity type system and train-ing set from a large corpus of biomedical texts by mapping entities to concepts in a medical ontology, and from these to Wikipedia pages whose categories are our types. From this map-ping we deriveBiomedical Interpretable Entity Representations(BIERs), in which dimensions correspond to fine-grained entity types, and values are predicted probabilities that a given entity is of the corresponding type. We propose a novel method that exploits BIER’s final sparse and intermediate dense representations to facilitate model and entity type debugging. We show that BIERs achieve strong performance in biomedical tasks including named entity disambiguation and entity linking, and we provide error analysis to highlight the utility of their interpretability, particularly in low-supervision settings. Finally, we provide our induced 68K biomedical type system, the corresponding 37 million triples of derived data used to train BIER models and our best per-forming model.
more »
« less
DomainNet: Homograph Detection and Understanding in Data Lake Disambiguation
Modern data lakes are heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes:How can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph?While word and entity disambiguation have been well studied in computational linguistics, data management, and data science, we show that data lakes provide a new opportunity for disambiguation of data values, because tables implicitly define a massive network of interconnected values. We introduceDomainNet, which efficiently represents this network, and investigate to what extent it can be used to disambiguate values without requiring any supervision. DomainNetleverages network-centrality measures on a bipartite graph whose nodes represent data values and attributes to determine if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs achieves an F1-score of 0.38 versus 0.69 forDomainNet, which separates homographs well from data values that have a unique meaning. On a real data lake, our top-100 precision is 93%. Given a homograph, we also present a novel method for determining the number of meanings of the homograph and for assigning its data lake attributes to a meaning. We show the influence of homographs on two downstream tasks: entity-matching and domain discovery.
more »
« less
- Award ID(s):
- 1956096
- PAR ID:
- 10557929
- Publisher / Repository:
- ACM
- Date Published:
- Journal Name:
- ACM Transactions on Database Systems
- Volume:
- 48
- Issue:
- 3
- ISSN:
- 0362-5915
- Page Range / eLocation ID:
- 1 to 40
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Recognizing entity synonyms from text has become a crucial task in many entity-leveraging applications. However, discovering entity synonyms from domain-specific text corpora (e.g., news articles, scientific papers) is rather challenging. Current systems take an entity name string as input to find out other names that are synonymous, ignoring the fact that often times a name string can refer to multiple entities (e.g., “apple” could refer to both Apple Inc and the fruit apple). Moreover, most existing methods require training data manually created by domain experts to construct supervised learning systems. In this paper, we study the problem of automatic synonym discovery with knowledge bases, that is, identifying synonyms for knowledge base entities in a given domain-specific corpus. The manually-curated synonyms for each entity stored in a knowledge base not only form a set of name strings to disambiguate the meaning for each other, but also can serve as “distant” supervision to help determine important features for the task. We propose a novel framework, called DPE, to integrate two kinds of mutually complementing signals for synonym discovery, i.e., distributional features based on corpus-level statistics and textual patterns based on local contexts. In particular, DPE jointly optimizes the two kinds of signals in conjunction with distant supervision, so that they can mutually enhance each other in the training stage. At the inference stage, both signals will be utilized to discover synonyms for the given entities. Experimental results prove the effectiveness of the proposed framework.more » « less
-
Abstract Depth regulates many attributes of aquatic ecosystems, but relatively few lakes are measured, and existing datasets are biased toward large lakes. To address this, we used a large dataset of maximum (Zmax;n = 16,831) and mean (Zmean;n = 5,881) depth observations to create new depth models, focusing on lakes < 1,000 ha. We then used the models to characterize patterns in lake basin shape and volume. We included terrain metrics, water temperature and reflectance, polygon attributes, and other predictors in a random forest model. Our final models generally outperformed existing models (Zmax; root mean square error [RMSE] = 8.0 m andZmean; RMSE = 3.0 m). Our models show that lake depth followed a Pareto distribution, with 2.8 orders of magnitude fewer lakes for an order of magnitude increase in depth. In addition, despite orders of magnitude variation in surface area, most size classes had a modal maximum depth of ~ 5 m. Concave (bowl‐shaped) lake basins represented 79% of all lakes, but lakes were more convex (funnel‐shaped) as surface area increased. Across the conterminous United States, 9.8% of all lake water was within the top meter of the water column, and 48% in the top 10 m. Excluding the Laurentian Great Lakes, we estimate the total volume in the conterminous United States is 1,057–1,294 km3, depending on whetherZmaxorZmeanwas modeled. Lake volume also exhibited substantial geographic variation, with high volumes in the upper Midwest, Northeast, and Florida and low volumes in the southwestern United States.more » « less
-
Abstract Local and regional‐scaled studies point to the important role of lake type (natural lakes vs. reservoirs), surface water connectivity, and ecological context (multi‐scaled natural settings and human factors) in mediating lake responses to disturbances like drought. However, we lack an understanding at the macroscale that incorporates multiple scales (lake, watershed, region) and a variety of ecological contexts. Therefore, we used data from the LAGOS‐US research platform and applied a local water year timeframe to 62,927 US natural lakes and reservoirs across 17 ecoregions to examine how chlorophyllaresponds to drought across various ecological contexts. We evaluated chlorophyllachanges relative to each lake's baseline and drought year. Drought led to lower and higher chlorophyllain 18% and 20%, respectively, of lakes (both natural lakes and reservoirs included). Natural lakes had higher magnitudes of change and probabilities of increasing chlorophylladuring droughts than reservoirs, and these differences were particularly pronounced in isolated and highly‐connected lakes. Drought responses were also related to long‐term average lake chlorophyllain complex ways, with a positive correlation in less productive lakes and a negative correlation in more productive lakes, and more pronounced drought responses in higher‐productivity lakes than lower‐productivity lakes. Thus, lake chlorophyll responses to drought are related to interactions between lake type and surface connectivity, long‐term average chlorophylla, and many other multi‐scaled ecological factors (e.g., soil erodibility, minimum air temperature). These results reinforce the importance of integrating multi‐scaled ecological context to determine and predict the impacts of global changes on lakes.more » « less
-
This data set is a derived data set based on fish catch data. Data are collected annually to enable us to track the fish assemblages of eleven primary lakes (Allequash, Big Muskellunge, Crystal, Sparkling, Trout, bog lakes 27-02 [Crystal Bog] and 12-15 [Trout Bog], Mendota, Monona, Wingra and Fish). Sampling on Lakes Monona, Wingra, and Fish started in 1995; sampling on other lakes started in 1981. Sampling is done at six littoral zone sites per lake with seine, minnow or crayfish traps, and fyke nets; a boat-mounted electrofishing system samples three littoral transects. Vertically hung gill nets are used to obtain two pelagic samples per lake from the deepest point. A trammel net samples across the thermocline at two sites per lake. In the bog lakes only fyke nets and minnow traps are deployed. Parameters measured include species-level identification and lengths for all fish caught, and weight and scale samples from a subset. Derived data sets include species richness, catch per unit effort, and size distribution by species, lake, and year. Species richness for a lake is the number of fish species caught in that lake during the annual fish sampling. Hybrids captured are only included in the richness value if neither of the two hybridized species are caught in the lake that year. Fish identified only to genus or higher taxonomic level are not included if any fish identified to species within that genus or higher taxonomic level are caught. E.g., Unidentified Chub would be only included in the richness value if no other chub is caught in that lake that year. Sampling Frequency: annually. Number of sites: 11 Notes: Beach seining was discontinued after 2019. 2020 data does not exist due to insufficient sampling. In 2021, sampling in Fish Lake was suspended due to significant lake level changes. Data is missing for the two bogs in 2022. Please consult NTL's website for information on experimental lake manipulations and the DNR's website for management activitiesmore » « less
An official website of the United States government

