NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Addressing discretization-induced bias in demographic prediction

https://doi.org/10.1093/pnasnexus/pgaf027

Dong, Evan; Schein, Aaron; Wang, Yixin; Garg, Nikhil (February 2025, PNAS Nexus)
Levy, Morris (Ed.)
Abstract Racial and other demographic imputation is necessary for many applications, especially in auditing disparities and outreach targeting in political campaigns. The canonical approach is to construct continuous predictions—e.g. based on name and geography—and then to often discretize the predictions by selecting the most likely class (argmax), potentially with a minimum threshold (thresholding). We study how this practice produces discretization bias. For example, we show that argmax labeling, as used by a prominent commercial voter file vendor to impute race/ethnicity, results in a substantial under-count of Black voters, e.g. by 28.2% points in North Carolina. This bias can have substantial implications in downstream tasks that use such labels. We then introduce a joint optimization approach—and a tractable data-driven threshold heuristic—that can eliminate this bias, with negligible individual-level accuracy loss. Finally, we theoretically analyze discretization bias, show that calibrated continuous models are insufficient to eliminate it, and that an approach such as ours is necessary. Broadly, we warn researchers and practitioners against discretizing continuous demographic predictions without considering downstream consequences.
more » « less
Free, publicly-accessible full text available February 1, 2026
Correlated Errors in Large Language Models

Kim, Elliot_Myunghoon; Garg, Avi; Peng, Kenny; Garg, Nikhil (June 2025, International Conference on Machine Learning)

Diversity in training data, architecture, and providers is assumed to mitigate homogeneity in LLMs. However, we lack empirical evidence on whether different LLMs differ \textit{meaningfully}. We conduct a large-scale empirical evaluation on over 350 LLMs overall, using two popular leaderboards and a resume-screening task. We find substantial correlation in model errors---on one leaderboard dataset, models agree 60% of the time when both models err. We identify factors driving model correlation, including shared architectures and providers. Crucially, however, larger and more accurate models have highly correlated errors, even with distinct architectures and providers. Finally, we show the effects of correlation in two downstream tasks: LLM-as-judge evaluation and hiring---the latter reflecting theoretical predictions regarding algorithmic monoculture.
more » « less
Free, publicly-accessible full text available June 18, 2026
Sparse Autoencoders for Hypothesis Generation

Movva, Rajiv; Peng, Kenny; Garg, Nikhil; Kleinberg, Jon; Pierson, Emma (June 2025, International Conference on Machine Learning)

We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., mentions being surprised or shocked) using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.
more » « less
Free, publicly-accessible full text available June 18, 2026
Balancing Producer Fairness and Efficiency via Prior-Weighted Rating System Design

https://doi.org/10.1609/icwsm.v19i1.35865

Ma, Thomas; Bernstein, Michael S; Johari, Ramesh; Garg, Nikhil (June 2025, Proceedings of the International AAAI Conference on Web and Social Media)

Online marketplaces use rating systems to promote the discovery of high-quality products. However, these systems also lead to high variance in producers' economic outcomes: a new producer who sells high-quality items, may unluckily receive a low rating early, severely impacting their future popularity. We investigate the design of rating systems that balance the goals of identifying high-quality products (``efficiency'') and minimizing the variance in outcomes of producers of similar quality (individual ``producer fairness'').We show that there is a trade-off between these two goals: rating systems that promote efficiency are necessarily less individually fair to producers. We introduce prior-weighted rating systems as an approach to managing this trade-off. Informally, the system we propose sets a system-wide prior for the quality of an incoming product; subsequently, the system updates that prior to a posterior for each product's quality based on user-generated ratings over time. We show theoretically that in markets where products accrue reviews at an equal rate, the strength of the rating system's prior determines the operating point on the identified trade-off: the stronger the prior, the more the marketplace discounts early ratings data (increasing individual fairness), but the slower the platform is in learning about true item quality (so efficiency suffers). We further analyze this trade-off in a responsive market where customers make decisions based on historical ratings. Through calibrated simulations in 19 different real-world datasets sourced from large online platforms, we show that the choice of prior strength mediates the same efficiency-consistency trade-off in this setting. Overall, we demonstrate that by tuning the prior as a design choice in a prior-weighted rating system, platforms can be intentional about the balance between efficiency and producer fairness.
more » « less
Free, publicly-accessible full text available June 7, 2026
Faster Information for Effective Long-Term Discharge: A Field Study in Adult Foster Care

https://doi.org/10.1145/3710983

Bartle, Vince; Shearer, Ashley; Wroe, Alexandra; Dell, Nicola; Garg, Nikhil (May 2025, Proceedings of the ACM on Human-Computer Interaction)

As the US population ages, a growing challenge is placing hospital patients who require long-term post-acute care into adult foster care facilities: small long-term nursing facilities that care for those unable to age in place because their care requirements exceed what can be delivered at home. A key challenge in patient placement is the dynamic matching process between hospital discharge coordinators looking to place patients and facilities looking for residents. We designed, built, deployed, and maintain a system to support decision making among a team of six discharge coordinators assisting in the discharge of 127 patients across 1,047 facilities in Hawai'i. Our system collects vacancy and capability data from facilities via conversational SMS and processes it to recommend facilities that discharge coordinators might contact. Findings from a 14-month deployment provide evidence for how timely, accurate information positively impacts matching efficacy. We close with lessons learned for information collection systems and provisioning platforms in similar contexts.
more » « less
Free, publicly-accessible full text available May 2, 2026
A No Free Lunch Theorem for Human-AI Collaboration

https://doi.org/10.1609/aaai.v39i13.33574

Peng, Kenny; Garg, Nikhil; Kleinberg, Jon (April 2025, Proceedings of the AAAI Conference on Artificial Intelligence)

The gold standard in human-AI collaboration is complementarity: when combined performance exceeds both the human and algorithm alone. We investigate this challenge in binary classification settings where the goal is to maximize 0-1 accuracy. Given two or more agents who can make calibrated probabilistic predictions, we show a No Free Lunch-style result. Any deterministic collaboration strategy (a function mapping calibrated probabilities into binary classifications) that does not essentially always defer to the same agent will sometimes perform worse than the least accurate agent. In other words, complementarity cannot be achieved for free. The result does suggest one model of collaboration with guarantees, where one agent identifies obvious errors of the other agent. We also use the result to understand the necessary conditions enabling the success of other collaboration techniques, providing guidance to human-AI collaboration.
more » « less
Free, publicly-accessible full text available April 11, 2026

Search for: All records