skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on November 10, 2025

Title: A randomized controlled trial on the nomenclature of scientific computing
Objective To meaningfully organize scientific computing based on evidence gathered through user feedback, build a statistical package based on the findings and provide a replication packet to run similar studies on people with different backgrounds. Method A randomized controlled trial using a weighted, ranked choice survey (n = 118) with between-subjects design having two independent variables: Language Group (Matlab, Python and R) and Method Name options. Our dependent variable was a normalized preference rating. Findings There was a very small interaction between Language Group and Method Name. Language Group did not have a statistically significant effect, but Method Name did (F(4, 27037) = 2211.23, p < .001)(𝜂2 𝑝 = .247). Finally, many names in Matlab, Python and R were ranked so poorly that they were not statistically significantly different from a random word in 63.0%, 62.2% and 30.4% of concepts respectively. Implications We found organized and structured names were ranked by a large margin, suggesting statistical programming today likely needs considerable improvement. Finally, we outline a statistical package built using these principles, provide comparison scripts and describe some of the challenges from going from simple surveys to in-practice libraries.  more » « less
Award ID(s):
2048356 2106392
PAR ID:
10581695
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Taylor and Francis
Date Published:
Journal Name:
Computer Science Education
ISSN:
0899-3408
Page Range / eLocation ID:
1 to 29
Subject(s) / Keyword(s):
Programming language usability scientific computing data science statistics
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We ask the question: Are there widespread disparities in machine translations of names across race/ethnicity, and gender? We hypothesize that the translation quality of names and surrounding context will be lower for names associated with US racial and ethnic minorities due to these systems' tendencies to standardize language to predominant language patterns. We develop a dataset of names that are strongly demographically aligned and propose a translation evaluation procedure based on round-trip translation. We analyze the effect of name demographics on translation quality using generalized linear mixed effects models and find that the ability of translation systems to correctly translate female-associated names is significantly lower than male-associated names. This effect is particularly pronounced for female-associated names that are also associated with racial (Black) and ethnic (Hispanic) minorities. This disparity in translation quality between social groups for something as personal as someone's name has significant implications for people's professional, personal, and cultural identities, self-worth and ease of communication. Our findings suggest that more MT research is needed to improve the translation of names and to provide high-quality service for users regardless of gender, race, and ethnicity. 
    more » « less
  2. null (Ed.)
    Inferring the set name of semantically grouped entities is useful in many tasks related to natural language processing and information retrieval. Previous studies mainly draw names from knowledge bases to ensure high quality, but that limits the candidate scope. We propose an unsupervised framework, AutoName, that exploits large-scale text corpora to name a set of query entities. Specifically, it first extracts hypernym phrases as candidate names from query-related documents via probing a pre-trained language model. A hierarchical density-based clustering is then applied to form potential concepts for these candidate names. Finally, AutoName ranks candidates and picks the top one as the set name based on constituents of the phrase and the semantic similarity of their concepts. We also contribute a new benchmark dataset for this task, consisting of 130 entity sets with name labels. Experimental results show that AutoName generates coherent and meaningful set names and significantly outperforms all compared methods. Further analyses show that AutoName is able to offer explanations for extracted names using the sentences most relevant to the corresponding concept. 
    more » « less
  3. The 3i World Auchenorrhyncha database (http://dmitriev.speciesfile.org) is being migrated into TaxonWorks (http://taxonworks.org) and comprises nomenclatural data for all known Auchenorrhyncha taxa (leafhoppers, planthoppers, treehoppers, cicadas, spittle bugs). Of all those scientific names, 8,700 are unique genus-group names (which include valid genera and subgenera as well as their synonyms). According to the Rules of Zoological Nomenclature, a properly formed species-group name when combined with a genus-group name must agree with the latter in gender if the species-group name is or ends with a Latin or Latinized adjective or participle. This provides a double challenge for researchers describing new or citing existing taxa. For each species, the knowledge about the part of speech is essential information (nouns do not change their form when associated with different generic names). For the genus, the knowledge of the gender is essential information. Every time the species is transferred from one genus to another, its ending may need to be transformed to make a proper new scientific name (a binominal name). In modern day practice, it is important, when establishing a new name, to provide information about etymology of this name and the ways it should be used in the future publications: the grammatical gender for a genus, and the part of speech for a species. The older names often do not provide enough information about their etymology to make proper construction of scientific names. That is why in the literature, we can find numerous cases where a scientific name is not formed in conformity to the Rules of Nomenclature. An attempt was made to resolve the etymology of the generic names in Auchenorrhyncha to unify and clarify nomenclatural issues in this group of insects. In TaxonWorks, the rules of nomenclature are defined using the NOMEN onthology (https://github.com/SpeciesFileGroup/nomen). 
    more » « less
  4. Abstract In several author name disambiguation studies, some ethnic name groups such as East Asian names are reported to be more difficult to disambiguate than others. This implies that disambiguation approaches might be improved if ethnic name groups are distinguished before disambiguation. We explore the potential of ethnic name partitioning by comparing performance of four machine learning algorithms trained and tested on the entire data or specifically on individual name groups. Results show that ethnicity‐based name partitioning can substantially improve disambiguation performance because the individual models are better suited for their respective name group. The improvements occur across all ethnic name groups with different magnitudes. Performance gains in predicting matched name pairs outweigh losses in predicting nonmatched pairs. Feature (e.g., coauthor name) similarities of name pairs vary across ethnic name groups. Such differences may enable the development of ethnicity‐specific feature weights to improve prediction for specific ethic name categories. These findings are observed for three labeled data with a natural distribution of problem sizes as well as one in which all ethnic name groups are controlled for the same sizes of ambiguous names. This study is expected to motive scholars to group author names based on ethnicity prior to disambiguation. 
    more » « less
  5. Variable names are critical for conveying intended program behavior. Machine learning-based program analysis methods use variable name representations for a wide range of tasks, such as suggesting new variable names and bug detection. Ideally, such methods could capture semantic relationships between names beyond syntactic similarity, e.g., the fact that the names average and mean are similar. Unfortunately, previous work has found that even the best of previous representation approaches primarily capture "relatedness" (whether two variables are linked at all), rather than "similarity" (whether they actually have the same meaning). We propose VarCLR, a new approach for learning semantic representations of variable names that effectively captures variable similarity in this stricter sense. We observe that this problem is an excellent fit for contrastive learning, which aims to minimize the distance between explicitly similar inputs, while maximizing the distance between dissimilar inputs. This requires labeled training data, and thus we construct a novel, weakly-supervised variable renaming dataset mined from GitHub edits. We show that VarCLR enables the effective application of sophisticated, general-purpose language models like BERT, to variable name representation and thus also to related downstream tasks like variable name similarity search or spelling correction. VarCLR produces models that significantly outperform the state-of-the-art on IdBench, an existing benchmark that explicitly captures variable similarity (as distinct from relatedness). Finally, we contribute a release of all data, code, and pre-trained models, aiming to provide a drop-in replacement for variable representations used in either existing or future program analyses that rely on variable names. 
    more » « less