skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Cracking the Code of Geo-Identifiers: Harnessing Data-Based Decision-Making for the Public Good
The accessibility of official statistics to non-expert users could be aided by employing natural language processing and deep learning models to dataset lexicons. Specifically, the semantic structure of FIPS codes would offer a relatively standardized data dictionary of column names and string variable structure to identify: two-digits for states, followed by three-digits for counties. The technical, methodological contribution of this paper is a bibliometric analysis of scientific publications based on FIPS code analysis indicated that between 27,954 and 1,970,000 publications attend to this geo-identifier. Within a single dataset reporting national representative and longitudinal survey data, 141 publications utilize FIPS data. The high incidence shows the research impact. Yet, the low proportion of only 2.0 percent of all publications utilizing this dataset also shows a gap even among expert users. A data use case drawn from public health data implies that cracking the code of geo-identifiers could advance access by helping everyday users formulate data inquiries within intuitive language.  more » « less
Award ID(s):
1934942
PAR ID:
10357276
Author(s) / Creator(s):
Editor(s):
Domenech, Josep; Vicente, María Rosalía
Date Published:
Journal Name:
International Conference on Advanced Research Methods and Analytics
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Geo-obfuscation serves as a location privacy protection mechanism (LPPM), enabling mobile users to share obfuscated locations with servers, rather than their exact locations. This method can protect users’ location privacy when data breaches occur on the server side since the obfuscation process is irreversible. To reduce the utility loss caused by data obfuscation, linear programming (LP) is widely employed, which, however, might suffer from a polynomial explosion of decision variables, rendering it impractical in largescale geo-obfuscation applications. In this paper, we propose a new LPPM, called Locally Relevant Geo-obfuscation (LR-Geo), to optimize geo-obfuscation using LP in a time-efficient manner. This is achieved by confining the geoobfuscation calculation for each user exclusively to the locally relevant (LR) locations to the user’s actual location. Given the potential risk of LR locations disclosing a user’s actual whereabouts, we enable users to compute the LP coefficients locally and upload them only to the server, rather than the LR locations. The server then solves the LP problem based on the received coefficients. Furthermore, we refine the LP framework by incorporating an exponential obfuscation mechanism to guarantee the indistinguishability of obfuscation distribution across multiple users. Based on the constraint structure of the LP formulation, we apply Benders’ decomposition to further enhance computational efficiency. Our theoretical analysis confirms that, despite the geo-obfuscation being calculated independently for each user, it still meets geo-indistinguishability constraints across multiple users with high probability. Finally, the experimental results based on a real-world dataset demonstrate that LR-Geo outperforms existing geo-obfuscation methods in computational time, data utility, and privacy preservation. 
    more » « less
  2. McClelland, Robert; Johnson, Barry (Ed.)
    As the US tax law evolves to adapt to ever-changing politico-economic realities, tax preparation software plays a significant role in helping taxpayers navigate these complexities. The dynamic nature of tax regulations poses a significant challenge to accurately and timely maintaining tax software artifacts. The state-of-the-art in maintaining tax prep software is time-consuming and error-prone as it involves manual code analysis combined with an expert interpretation of tax law amendments. We posit that the rigor and formality of tax amendment language, as expressed in IRS publications, makes it amenable to automatic translation to executable specifications (code). Our research efforts focus on identifying, understanding, and tackling technical challenges in leveraging Large Language Models (LLMs), such as ChatGPT and Llama, to faithfully extract code differentials from IRS publications and automatically integrate them with the prior version of the code to automate tax prep software maintenance. 
    more » « less
  3. Code LLMs have the potential to make it easier for non-experts to understand and write code. However, current CodeLLM benchmarks rely on a single expert-written prompt per problem, making it hard to generalize their success to non-expert users. In this paper, we present a new natural-language-to-code benchmark of prompts written by a key population of non-experts: beginning programmers. StudentEval contains 1,749 prompts written by 80 students who have only completed one introductory Python course. StudentEval contains numerous non-expert prompts describing the same problem, enabling exploration of key factors in prompt success. We use StudentEval to evaluate 12 Code LLMs and find that StudentEval is a better discriminator of model performance than existing benchmarks. Our analysis of student prompting strategies reveals that nondeterministic LLM sampling can mislead students about the quality of their descriptions, a finding with key implications for Code LLMs in education. 
    more » « less
  4. Abstract Data containing geospatial semantics, such as geotagged tweets, travel blogs, and crime reports, associates natural language texts with geographical locations. This paper presents a lens‐based visual interaction technique, GTMapLens, to flexibly browse the geo‐text data on a map. It allows users to perform dynamic focus+context exploration by using movable lenses to browse geographical regions, find locations of interest, and perform comparative and drill‐down studies. Geo‐text data is visualized in a way that users can easily perceive the underlying geospatial semantics along with lens moving. Based on a requirement analysis with a cohort of multidisciplinary domain experts, a set of lens interaction techniques are developed including keywords control, path management, context visualization, and snapshot anchors. They allow users to achieve a guided and controllable exploration of geo‐text data. A hierarchical data model enables the interactive lens operations by accelerated data retrieval from a geo‐text database. Evaluation with real‐world datasets is presented to show the usability and effectiveness of GTMapLens. 
    more » « less
  5. Abstract: How well do code-writing tasks measure students’ knowledge of programming patterns and anti-patterns? How can we assess this knowledge more accurately? To explore these questions, we surveyed 328 intermediate CS students and measured their performance on different types of tasks, including writing code, editing someone else’s code, and, if applicable, revising their own alternatively-structured code. Our tasks targeted returning a Boolean expression and using unique code within an if and else.We found that code writing sometimes under-estimated student knowledge. For tasks targeting returning a Boolean expression, over 55% of students who initially wrote with non-expert structure successfully revised to expert structure when prompted - even though the prompt did not include guidance on how to improve their code. Further, over 25% of students who initially wrote non-expert code could properly edit someone else’s non-expert code to expert structure. These results show that non-expert code is not a reliable indicator of deep misconceptions about the structure of expert code. Finally, although code writing is correlated with code editing, the relationship is weak: a model with code writing as the sole predictor of code editing explains less than 15% of the variance. Model accuracy improves when we include additional predictors that reflect other facets of knowledge, namely the identification of expert code and selection of expert code as more readable than non-expert code. Together, these results indicate that a combination of code writing, revising, editing, and identification tasks can provide a more accurate assessment of student knowledge of programming patterns than code writing alone. 
    more » « less