Missing hierarchical is-a relations and missing concepts are common quality issues in biomedical ontologies. Non-lattice subgraphs have been extensively studied for automatically identifying missing is-a relations in biomedical ontologies like SNOMED CT. However, little is known about non-lattice subgraphs’ capability to uncover new or missing concepts in biomedical ontologies. In this work, we investigate a lexical-based intersection approach based on non-lattice subgraphs to identify potential missing concepts in SNOMED CT. We first construct lexical features of concepts using their fully specified names. Then we generate hierarchically unrelated concept pairs in non-lattice subgraphs as the candidates to derive new concepts. For each candidate pair of concepts, we conduct an order-preserving intersection based on the two concepts’ lexical features, with the intersection result serving as the potential new concept name suggested. We further perform automatic validation through terminologies in the Unified Medical Language System (UMLS) and literature in PubMed. Applying this approach to the March 2021 release of SNOMED CT US Edition, we obtained 7,702 potential missing concepts, among which 1,288 were validated through UMLS and 1,309 were validated through PubMed. The results showed that non-lattice subgraphs have the potential to facilitate suggestion of new concepts for SNOMED CT.
more »
« less
AutoName: A Corpus-Based Set Naming Framework
Inferring the set name of semantically grouped entities is useful in many tasks related to natural language processing and information retrieval. Previous studies mainly draw names from knowledge bases to ensure high quality, but that limits the candidate scope. We propose an unsupervised framework, AutoName, that exploits large-scale text corpora to name a set of query entities. Specifically, it first extracts hypernym phrases as candidate names from query-related documents via probing a pre-trained language model. A hierarchical density-based clustering is then applied to form potential concepts for these candidate names. Finally, AutoName ranks candidates and picks the top one as the set name based on constituents of the phrase and the semantic similarity of their concepts. We also contribute a new benchmark dataset for this task, consisting of 130 entity sets with name labels. Experimental results show that AutoName generates coherent and meaningful set names and significantly outperforms all compared methods. Further analyses show that AutoName is able to offer explanations for extracted names using the sentences most relevant to the corresponding concept.
more »
« less
- Award ID(s):
- 1813662
- PAR ID:
- 10276024
- Date Published:
- Journal Name:
- Proceedings of The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 21)
- Page Range / eLocation ID:
- 2101 to 2105
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Typosquatting—the practice of registering a domain name similar to another, usually well-known, domain name—is typically intended to drive traffic to a website for malicious or profit- driven purposes. In this paper we assess the current state of typosquatting, both broadly (across a wide variety of techniques) and deeply (using an extensive and novel dataset). Our breadth derives from the application of eight different candidate-generation techniques to a selection of the most popular domain names. Our depth derives from probing the resulting name set via a unique corpus comprising over 3.3B Domain Name System (DNS) records. We find that over 2.3M potential typosquatting names have been registered that resolve to an IP address. We then assess those names using a framework focused on identifying the intent of the domain from the perspectives of DNS and webpage clustering. Using the DNS information, HTTP responses, and Google SafeBrowsing, we classify the candidate typosquatting names as resolved to private IP, malicious, defensive, parked, legitimate, or unknown intents. Our findings provide the largest-scale and most-comprehensive perspective to date on typosquatting, exposing potential risks to users. Further, our methodology provides a blueprint for tracking and classifying typosquatting on an ongoing basis.more » « less
-
Typosquatting—the practice of registering a domain name similar to another, usually well-known, domain name—is typically intended to drive traffic to a website for malicious or profitdriven purposes. In this paper we assess the current state of typosquatting, both broadly (across a wide variety of techniques) and deeply (using an extensive and novel dataset). Our breadth derives from the application of eight different candidate-generation techniques to a selection of the most popular domain names. Our depth derives from probing the resulting name set via a unique corpus comprising over 3.3B Domain Name System (DNS) records. We find that over 2.3M potential typosquatting names have been registered that resolve to an IP address. We then assess those names using a framework focused on identifying the intent of the domain from the perspectives of DNS and webpage clustering. Using the DNS information, HTTP responses, and Google SafeBrowsing, we classify the candidate typosquatting names as resolved to private IP, malicious, defensive, parked, legitimate, or unknown intents. Our findings provide the largest-scale and most-comprehensive perspective to date on typosquatting, exposing potential risks to users. Further, our methodology provides a blueprint for tracking and classifying typosquatting on an ongoing basis.more » « less
-
Objective To meaningfully organize scientific computing based on evidence gathered through user feedback, build a statistical package based on the findings and provide a replication packet to run similar studies on people with different backgrounds. Method A randomized controlled trial using a weighted, ranked choice survey (n = 118) with between-subjects design having two independent variables: Language Group (Matlab, Python and R) and Method Name options. Our dependent variable was a normalized preference rating. Findings There was a very small interaction between Language Group and Method Name. Language Group did not have a statistically significant effect, but Method Name did (F(4, 27037) = 2211.23, p < .001)(𝜂2 𝑝 = .247). Finally, many names in Matlab, Python and R were ranked so poorly that they were not statistically significantly different from a random word in 63.0%, 62.2% and 30.4% of concepts respectively. Implications We found organized and structured names were ranked by a large margin, suggesting statistical programming today likely needs considerable improvement. Finally, we outline a statistical package built using these principles, provide comparison scripts and describe some of the challenges from going from simple surveys to in-practice libraries.more » « less
-
Variable names are critical for conveying intended program behavior. Machine learning-based program analysis methods use variable name representations for a wide range of tasks, such as suggesting new variable names and bug detection. Ideally, such methods could capture semantic relationships between names beyond syntactic similarity, e.g., the fact that the names average and mean are similar. Unfortunately, previous work has found that even the best of previous representation approaches primarily capture "relatedness" (whether two variables are linked at all), rather than "similarity" (whether they actually have the same meaning). We propose VarCLR, a new approach for learning semantic representations of variable names that effectively captures variable similarity in this stricter sense. We observe that this problem is an excellent fit for contrastive learning, which aims to minimize the distance between explicitly similar inputs, while maximizing the distance between dissimilar inputs. This requires labeled training data, and thus we construct a novel, weakly-supervised variable renaming dataset mined from GitHub edits. We show that VarCLR enables the effective application of sophisticated, general-purpose language models like BERT, to variable name representation and thus also to related downstream tasks like variable name similarity search or spelling correction. VarCLR produces models that significantly outperform the state-of-the-art on IdBench, an existing benchmark that explicitly captures variable similarity (as distinct from relatedness). Finally, we contribute a release of all data, code, and pre-trained models, aiming to provide a drop-in replacement for variable representations used in either existing or future program analyses that rely on variable names.more » « less
An official website of the United States government

