NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Designing LLM Chains by Adapting Techniques from Crowdsourcing Workflows

https://doi.org/10.1145/3716134

Grunde-McLaughlin, Madeleine; Lam, Michelle S; Krishna, Ranjay; Weld, Daniel S; Heer, Jeffrey (June 2025, ACM Transactions on Computer-Human Interaction)

LLM chains enable complex tasks by decomposing work into a sequence of subtasks. Similarly, the more established techniques of crowdsourcing workflows decompose complex tasks into smaller tasks for human crowdworkers. Chains address LLM errors analogously to the way crowdsourcing workflows address human error. To characterize opportunities for LLM chaining, we survey 107 papers across the crowdsourcing and chaining literature to construct a design space for chain development. The design space covers a designer’sobjectivesand thetacticsused to build workflows. We then surfacestrategiesthat mediate how workflows use tactics to achieve objectives. To explore how techniques from crowdsourcing may apply to chaining, we adapt crowdsourcing workflows to implement LLM chains across three case studies: creating a taxonomy, shortening text, and writing a short story. From the design space and our case studies, we identify takeaways for effective chain design and raise implications for future research and development.
more » « less
Full Text Available
Reddit Rules and Rulers: Quantifying the Link Between Rules and Perceptions of Governance Across Thousands of Communities

https://doi.org/10.1609/icwsm.v19i1.35863

Leibmann, Leon; Weld, Galen; Zhang, Amy X; Althoff, Tim (June 2025, Proceedings of the International AAAI Conference on Web and Social Media)

Rules are a critical component of the functioning of nearly every online community, yet it is challenging for community moderators to make data-driven decisions about what rules to set for their communities. The connection between a community's rules and how its membership feels about its governance is not well understood. In this work, we conduct the largest-to-date analysis of rules on Reddit, collecting a set of 67,545 unique rules across 5,225 communities which collectively account for more than 67% of all content on Reddit. More than just a point-in-time study, our work measures how communities change their rules over a 5+ year period. We develop a method to classify these rules using a taxonomy of 17 key attributes extended from previous work. We assess what types of rules are most prevalent, how rules are phrased, and how they vary across communities of different types. Using a dataset of communities' discussions about their governance, we are the first to identify the rules most strongly associated with positive community perceptions of governance: rules addressing who participates, how content is formatted and tagged, and rules about commercial activities. We conduct a longitudinal study to quantify the impact of adding new rules to communities, finding that after a rule is added, community perceptions of governance immediately improve, yet this effect diminishes after six months. Our results have important implications for platforms, moderators, and researchers. We make our classification model and rules datasets public to support future research on this topic.
more » « less
Full Text Available
Subjective evidence evaluation survey for many-analysts studies

https://doi.org/10.1098/rsos.240125

Sarafoglou, Alexandra; Hoogeveen, Suzanne; van_den_Bergh, Don; Aczel, Balazs; Albers, Casper J; Althoff, Tim; Botvinik-Nezer, Rotem; Busch, Niko A; Cataldo, Andrea M; Devezer, Berna; et al (July 2024, Royal Society Open Science)

Many-analysts studies explore how well an empirical claim withstands plausible alternative analyses of the same dataset by multiple, independent analysis teams. Conclusions from these studies typically rely on a single outcome metric (e.g. effect size) provided by each analysis team. Although informative about the range of plausible effects in a dataset, a single effect size from each team does not provide a complete, nuanced understanding of how analysis choices are related to the outcome. We used the Delphi consensus technique with input from 37 experts to develop an 18-item subjective evidence evaluation survey (SEES) to evaluate how each analysis team views the methodological appropriateness of the research design and the strength of evidence for the hypothesis. We illustrate the usefulness of the SEES in providing richer evidence assessment with pilot data from a previous many-analysts study.
more » « less
Full Text Available
rTisane: Externalizing conceptual models for data analysis prompts reconsideration of domain assumptions and facilitates statistical modeling

https://doi.org/10.1145/3613904.3642267

Jun, Eunice; Misback, Edward; Heer, Jeffrey; Just, Rene (May 2024, ACM)

Full Text Available
How Do Analysts Understand and Verify AI-Assisted Data Analyses?

https://doi.org/10.1145/3613904.3642497

Gu, Ken; Shang, Ruoxi; Althoff, Tim; Wang, Chenglong; Drucker, Steven M (May 2024, ACM)

Full Text Available
How Do Data Analysts Respond to AI Assistance? A Wizard-of-Oz Study

https://doi.org/10.1145/3613904.3641891

Gu, Ken; Grunde-McLaughlin, Madeleine; McNutt, Andrew; Heer, Jeffrey; Althoff, Tim (May 2024, ACM)

Full Text Available
Understanding and Supporting Debugging Workflows in Multiverse Analysis

https://doi.org/10.1145/3544548.3581099

Gu, Ken; Jun, Eunice; Althoff, Tim (April 2023, CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems)

Multiverse analysis—a paradigm for statistical analysis that considers all combinations of reasonable analysis choices in parallel—promises to improve transparency and reproducibility. Although recent tools help analysts specify multiverse analyses, they remain difficult to use in practice. In this work, we identify debugging as a key barrier due to the latency from running analyses to detecting bugs and the scale of metadata processing needed to diagnose a bug. To address these challenges, we prototype a command-line interface tool, Multiverse Debugger, which helps diagnose bugs in the multiverse and propagate fixes. In a qualitative lab study (n=13), we use Multiverse Debugger as a probe to develop a model of debugging workflows and identify specific challenges, including difficulty in understanding the multiverse’s composition. We conclude with design implications for future multiverse analysis authoring systems.
more » « less
Full Text Available
ScatterShot: Interactive In-context Example Curation for Text Transformation

https://doi.org/10.1145/3581641.3584059

Wu, Sherry; Shen, Hua; Weld, Daniel S; Heer, Jeffrey; Ribeiro, Marco Tulio (March 2023, IUI '23: Proceedings of the 28th International Conference on Intelligent User Interfaces)

The in-context learning capabilities of LLMs like GPT-3 allow annotators to customize an LLM to their specific tasks with a small number of examples. However, users tend to include only the most obvious patterns when crafting examples, resulting in underspecified in-context functions that fall short on unseen cases. Further, it is hard to know when “enough” examples have been included even for known patterns. In this work, we present ScatterShot, an interactive system for building high-quality demonstration sets for in-context learning. ScatterShot iteratively slices unlabeled data into task-specific patterns, samples informative inputs from underexplored or not-yet-saturated slices in an active learning manner, and helps users label more efficiently with the help of an LLM and the current example set. In simulation studies on two text perturbation scenarios, ScatterShot sampling improves the resulting few-shot functions by 4-5 percentage points over random sampling, with less variance as more examples are added. In a user study, ScatterShot greatly helps users in covering different patterns in the input space and labeling in-context examples more efficiently, resulting in better in-context learning and less user effort.
more » « less
Full Text Available
Mosaic: An Architecture for Scalable & Interoperable Data Views

https://doi.org/10.1109/TVCG.2023.3327189

Heer, Jeffrey; Moritz, Dominik (January 2023, IEEE Transactions on Visualization and Computer Graphics)

Full Text Available
Homekit2020: A Benchmark for Time Series Classification on a Large Mobile Sensing Dataset with Laboratory Tested Ground Truth of Influenza Infections

Merrill, Mike A; Safranchik, Esteban; Kolbeinsson, Arinbjörn; Gade, Piyusha; Ramirez, Ernesto; Schmidt, Ludwig; Foschini, Luca; Althoff, Tim (January 2023, Conference on Health, Inference, and Learning (CHIL))

Despite increased interest in wearables as tools for detecting various health conditions, there are not as of yet any large public benchmarks for such mobile sensing data. The few datasets that are available do not contain data from more than dozens of individuals, do not contain high-resolution raw data or do not include dataloaders for easy integration into machine learning pipelines. Here, we present Homekit2020: the first large-scale public benchmark for time series classification of wearable sensor data. Our dataset contains over 14 million hours of minute-level multimodal Fitbit data, symptom reports, and ground-truth laboratory PCR influenza test results, along with an evaluation framework that mimics realistic model deployments and efficiently characterizes statistical uncertainty in model selection in the presence of extreme class imbalance. Furthermore, we implement and evaluate nine neural and non-neural time series classification models on our benchmark across 450 total training runs in order to establish state of the art performance.
more » « less
Full Text Available

« Prev Next »

Search for: All records