skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Award ID contains: 2339427

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Levy, Morris (Ed.)
    Abstract Racial and other demographic imputation is necessary for many applications, especially in auditing disparities and outreach targeting in political campaigns. The canonical approach is to construct continuous predictions—e.g. based on name and geography—and then to often discretize the predictions by selecting the most likely class (argmax), potentially with a minimum threshold (thresholding). We study how this practice produces discretization bias. For example, we show that argmax labeling, as used by a prominent commercial voter file vendor to impute race/ethnicity, results in a substantial under-count of Black voters, e.g. by 28.2% points in North Carolina. This bias can have substantial implications in downstream tasks that use such labels. We then introduce a joint optimization approach—and a tractable data-driven threshold heuristic—that can eliminate this bias, with negligible individual-level accuracy loss. Finally, we theoretically analyze discretization bias, show that calibrated continuous models are insufficient to eliminate it, and that an approach such as ours is necessary. Broadly, we warn researchers and practitioners against discretizing continuous demographic predictions without considering downstream consequences. 
    more » « less
    Free, publicly-accessible full text available February 1, 2026
  2. Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ \textit{meaningfully}. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors---on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring---the latter reflecting theoretical predictions regarding algorithmic monoculture. 
    more » « less
    Free, publicly-accessible full text available June 18, 2026
  3. We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., mentions being surprised or shocked) using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines. 
    more » « less
    Free, publicly-accessible full text available June 18, 2026
  4. Online marketplaces use rating systems to promote the discovery of high-quality products. However, these systems also lead to high variance in producers' economic outcomes: a new producer who sells high-quality items, may unluckily receive a low rating early, severely impacting their future popularity. We investigate the design of rating systems that balance the goals of identifying high-quality products (``efficiency'') and minimizing the variance in outcomes of producers of similar quality (individual ``producer fairness'').We show that there is a trade-off between these two goals: rating systems that promote efficiency are necessarily less individually fair to producers. We introduce prior-weighted rating systems as an approach to managing this trade-off. Informally, the system we propose sets a system-wide prior for the quality of an incoming product; subsequently, the system updates that prior to a posterior for each product's quality based on user-generated ratings over time. We show theoretically that in markets where products accrue reviews at an equal rate, the strength of the rating system's prior determines the operating point on the identified trade-off: the stronger the prior, the more the marketplace discounts early ratings data (increasing individual fairness), but the slower the platform is in learning about true item quality (so efficiency suffers). We further analyze this trade-off in a responsive market where customers make decisions based on historical ratings. Through calibrated simulations in 19 different real-world datasets sourced from large online platforms, we show that the choice of prior strength mediates the same efficiency-consistency trade-off in this setting. Overall, we demonstrate that by tuning the prior as a design choice in a prior-weighted rating system, platforms can be intentional about the balance between efficiency and producer fairness. 
    more » « less
    Free, publicly-accessible full text available June 7, 2026
  5. As the US population ages, a growing challenge is placing hospital patients who require long-term post-acute care into adult foster care facilities: small long-term nursing facilities that care for those unable to age in place because their care requirements exceed what can be delivered at home. A key challenge in patient placement is the dynamic matching process between hospital discharge coordinators looking to place patients and facilities looking for residents. We designed, built, deployed, and maintain a system to support decision making among a team of six discharge coordinators assisting in the discharge of 127 patients across 1,047 facilities in Hawai'i. Our system collects vacancy and capability data from facilities via conversational SMS and processes it to recommend facilities that discharge coordinators might contact. Findings from a 14-month deployment provide evidence for how timely, accurate information positively impacts matching efficacy. We close with lessons learned for information collection systems and provisioning platforms in similar contexts. 
    more » « less
    Free, publicly-accessible full text available May 2, 2026
  6. The gold standard in human-AI collaboration is complementarity: when combined performance exceeds both the human and algorithm alone. We investigate this challenge in binary classification settings where the goal is to maximize 0-1 accuracy. Given two or more agents who can make calibrated probabilistic predictions, we show a No Free Lunch-style result. Any deterministic collaboration strategy (a function mapping calibrated probabilities into binary classifications) that does not essentially always defer to the same agent will sometimes perform worse than the least accurate agent. In other words, complementarity cannot be achieved for free. The result does suggest one model of collaboration with guarantees, where one agent identifies obvious errors of the other agent. We also use the result to understand the necessary conditions enabling the success of other collaboration techniques, providing guidance to human-AI collaboration. 
    more » « less
    Free, publicly-accessible full text available April 11, 2026