Public records requests are a central mechanism for government transparency. In practice, they are slow, complex processes that require analyzing large amounts of messy, unstructured data. In this paper, we introduce RequestAtlas, a system that helps investigative journalists review large quantities of unstructured data that result from submitting many public records requests. RequestAtlas was developed through a year-long participatory design collaboration with the California Reporting Project (CRP), a journalistic collective researching police use of force and police misconduct in California. RequestAtlas helps journalists evaluate the results of public records requests for completeness and negotiate with agencies for additional information. RequestAtlas has had significant real-world impact. It has been deployed for more than a year to identify missing data in response to public records requests and to facilitate negotiation with public records request officers. Through the process of designing and observing the use of RequestAtlas, we explore the technical challenges associated with the public records request process and the design needs of investigative journalists more generally. We argue that public records requests represent an instance of an adversarialtechnical relationshipin which two entities engage in a prolonged, iterative, often adversarial exchange of information. Technologists can support information-gathering efforts within these adversarial technical relationships by building flexible local solutions that help both entities account for the state of the ongoing information exchange. Additionally, we offer insights on ways to design applications that can assist investigative journalists in the inevitably significant data cleaning phase of processing large documents while supporting journalistic norms of verification and human review. Finally, we reflect on the ways that this participatory design process, despite its success, lays bare some of the limitations inherent in the public records request process and in the ''request and respond'' model of transparency more generally.
more »
« less
Dealing with Acronyms, Abbreviations, and Typos in Real-World Entity Matching
String matching is at the core of data cleaning, record matching, and information retrieval. String matching relies on a similarity measure that evaluates the similarity of two strings, regarding the two as a match if their similarity is larger than a user-defined threshold. In our collaboration with journalists and public defenders, we found that real-world datasets, such as police rosters that journalists and public defenders work with, often contain acronyms, abbreviations, and typos, thanks to errors during manual entry, into, say, a spreadsheet or a form. Unfortunately, traditional similarity measures lead to low accuracy since they do not consider all three aspects together. Some recent work proposes leveraging synonym rules to improve matching, but either requires these rules to be provided upfront, or generated prior to matching, which leads to low accuracy in our setting and similar ones. To address these limitations, we propose Smash, a simple yet effective measure to assess the similarity of two strings with acronyms, abbreviations, and typos, all without relying on synonym rules. We design a dynamic programming algorithm to efficiently compute this measure, along with two optimizations that improve accuracy. We show that compared to the best baselines, including one based on ChatGPT with GPT-4, Smash improves the max and mean F-score by 23.5% and 110.8%, respectively. We implement Smash in OpenRefine, a graphical data cleaning tool, to facilitate its use by journalists, public defenders, and other non-programmers for data cleaning.
more »
« less
- Award ID(s):
- 2243822
- PAR ID:
- 10646609
- Publisher / Repository:
- ACM
- Date Published:
- Journal Name:
- Proceedings of the VLDB Endowment
- Volume:
- 17
- Issue:
- 12
- ISSN:
- 2150-8097
- Page Range / eLocation ID:
- 4104 to 4116
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Motivation: Intra-sample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. In practice, the true string sets in a sample are often unknown due to limitations in sequencing technology. In order to compare heterogeneous samples, genome graphs can be used to represent such sets of strings. However, a genome graph is generally able to represent a string set universe that contains multiple sets of strings in addition to the true string set. This difference between genome graphs and string sets is not well characterized. As a result, a distance metric between genome graphs may not match the distance between true string sets. Results: We extend a genome graph distance metric, Graph Traversal Edit Distance (GTED) proposed by Ebrahimpour Boroojeny et al., to FGTED to model the distance between heterogeneous string sets and show that GTED and FGTED always underestimate the Earth Mover’s Edit Distance (EMED) between string sets. We introduce the notion of string set universe diameter of a genome graph. Using the diameter, we are able to upper-bound the deviation of FGTED from EMED and to improve FGTED so that it reduces the average error in empirically estimating the similarity between true string sets. On simulated T-cell receptor sequences and actual Hepatitis B virus genomes, we show that the diameter-corrected FGTED reduces the average deviation of the estimated distance from the true string set distances by more than 250%. Availability and implementation: Data and source code for reproducing the experiments are available at: https:// github.com/Kingsford-Group/gtedemedtest/.more » « less
-
Recognizing entity synonyms from text has become a crucial task in many entity-leveraging applications. However, discovering entity synonyms from domain-specific text corpora (e.g., news articles, scientific papers) is rather challenging. Current systems take an entity name string as input to find out other names that are synonymous, ignoring the fact that often times a name string can refer to multiple entities (e.g., “apple” could refer to both Apple Inc and the fruit apple). Moreover, most existing methods require training data manually created by domain experts to construct supervised learning systems. In this paper, we study the problem of automatic synonym discovery with knowledge bases, that is, identifying synonyms for knowledge base entities in a given domain-specific corpus. The manually-curated synonyms for each entity stored in a knowledge base not only form a set of name strings to disambiguate the meaning for each other, but also can serve as “distant” supervision to help determine important features for the task. We propose a novel framework, called DPE, to integrate two kinds of mutually complementing signals for synonym discovery, i.e., distributional features based on corpus-level statistics and textual patterns based on local contexts. In particular, DPE jointly optimizes the two kinds of signals in conjunction with distant supervision, so that they can mutually enhance each other in the training stage. At the inference stage, both signals will be utilized to discover synonyms for the given entities. Experimental results prove the effectiveness of the proposed framework.more » « less
-
Automatically extracted metadata from scholarly documents in PDF formats is usually noisy and heterogeneous, often containing incomplete fields and erroneous values. One common way of cleaning metadata is to use a bibliographic reference dataset. The challenge is to match records between corpora with high precision. The existing solution which is based on information retrieval and string similarity on titles works well only if the titles are cleaned. We introduce a system designed to match scholarly document entities with noisy metadata against a reference dataset. The blocking function uses the classic BM25 algorithm to find the matching candidates from the reference data that has been indexed by ElasticSearch. The core components use supervised methods which combine features extracted from all available metadata fields. The system also leverages available citation information to match entities. The combination of metadata and citation achieves high accuracy that significantly outperforms the baseline method on the same test dataset. We apply this system to match the database of CiteSeerX against Web of Science, PubMed, and DBLP. This method will be deployed in the CiteSeerX system to clean metadata and link records to other scholarly big datasets.more » « less
-
A native cross-platform mobile app has multiple platform-specific implementations. Typically, an app is developed for one platform and then ported to the remaining ones. Translating an app from one language (e.g., Java) to another (e.g., Swift) by hand is tedious and error-prone, while automated translators either require manually defined translation rules or focus on translating APIs. To automate the translation of native cross-platform apps, we present J2SINFERER, a novel approach that iteratively infers syntactic transformation rules and API mappings from Java to Swift. Given a software corpus in both languages, J2SLNFERER first identifies the syntactically equivalent code based on braces and string similarity. For each pair of similar code segments, J2SLNFERER then creates syntax trees of both languages, leveraging the minimalist domain knowledge of language correspondence (e.g., operators and markers) to iteratively align syntax tree nodes, and to infer both syntax and API mapping rules. J2SLNFERER represents inferred rules as string templates, stored in a database, to translate code from Java to Swift. We evaluated J2SLNFERER with four applications, using one part of the data to infer translation rules, and the other part to apply the rules. With 76% in-project accuracy and 65% cross-project accuracy, J2SLNFERER outperforms in accuracy j2swift, a state-of-the-art Java-to-Swift conversion tool. As native cross-platform mobile apps grow in popularity, J2SLNFERER can shorten their time to market by automating the tedious and error prone task of source-to-source translation.more » « less
An official website of the United States government

