skip to main content


Search for: All records

Award ID contains: 1901386

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract

    Large scale analysis of source code, and in particular scientific source code, holds the promise of better understanding the data science process, identifying analytical best practices, and providing insights to the builders of scientific toolkits. However, large corpora have remained unanalyzed in depth, as descriptive labels are absent and require expert domain knowledge to generate. We propose a novel weakly supervised transformer-based architecture for computing joint representations of code from both abstract syntax trees and surrounding natural language comments. We then evaluate the model on a new classification task for labeling computational notebook cells as stages in the data analysis process from data import to wrangling, exploration, modeling, and evaluation. We show that our model, leveraging only easily-available weak supervision, achieves a 38% increase in accuracy over expert-supplied heuristics and outperforms a suite of baselines. Our model enables us to examine a set of 118,000 Jupyter Notebooks to uncover common data analysis patterns. Focusing on notebooks with relationships to academic articles, we conduct the largest study of scientific code to date and find that notebooks which devote an higher fraction of code to the typically labor-intensive process of wrangling data in expectation exhibit decreased citation counts for corresponding papers. We also show significant differences between academic and non-academic notebooks, including that academic notebooks devote substantially more code to wrangling and exploring data, and less on modeling.

     
    more » « less
  2. Multiverse analysis—a paradigm for statistical analysis that considers all combinations of reasonable analysis choices in parallel—promises to improve transparency and reproducibility. Although recent tools help analysts specify multiverse analyses, they remain difficult to use in practice. In this work, we identify debugging as a key barrier due to the latency from running analyses to detecting bugs and the scale of metadata processing needed to diagnose a bug. To address these challenges, we prototype a command-line interface tool, Multiverse Debugger, which helps diagnose bugs in the multiverse and propagate fixes. In a qualitative lab study (n=13), we use Multiverse Debugger as a probe to develop a model of debugging workflows and identify specific challenges, including difficulty in understanding the multiverse’s composition. We conclude with design implications for future multiverse analysis authoring systems. 
    more » « less
    Free, publicly-accessible full text available April 19, 2024
  3. The in-context learning capabilities of LLMs like GPT-3 allow annotators to customize an LLM to their specific tasks with a small number of examples. However, users tend to include only the most obvious patterns when crafting examples, resulting in underspecified in-context functions that fall short on unseen cases. Further, it is hard to know when “enough” examples have been included even for known patterns. In this work, we present ScatterShot, an interactive system for building high-quality demonstration sets for in-context learning. ScatterShot iteratively slices unlabeled data into task-specific patterns, samples informative inputs from underexplored or not-yet-saturated slices in an active learning manner, and helps users label more efficiently with the help of an LLM and the current example set. In simulation studies on two text perturbation scenarios, ScatterShot sampling improves the resulting few-shot functions by 4-5 percentage points over random sampling, with less variance as more examples are added. In a user study, ScatterShot greatly helps users in covering different patterns in the input space and labeling in-context examples more efficiently, resulting in better in-context learning and less user effort. 
    more » « less
  4. Despite increased interest in wearables as tools for detecting various health conditions, there are not as of yet any large public benchmarks for such mobile sensing data. The few datasets that are available do not contain data from more than dozens of individuals, do not contain high-resolution raw data or do not include dataloaders for easy integration into machine learning pipelines. Here, we present Homekit2020: the first large-scale public benchmark for time series classification of wearable sensor data. Our dataset contains over 14 million hours of minute-level multimodal Fitbit data, symptom reports, and ground-truth laboratory PCR influenza test results, along with an evaluation framework that mimics realistic model deployments and efficiently characterizes statistical uncertainty in model selection in the presence of extreme class imbalance. Furthermore, we implement and evaluate nine neural and non-neural time series classification models on our benchmark across 450 total training runs in order to establish state of the art performance. 
    more » « less
  5. Abstract An unhealthy diet is a major risk factor for chronic diseases including cardiovascular disease, type 2 diabetes, and cancer 1–4 . Limited access to healthy food options may contribute to unhealthy diets 5,6 . Studying diets is challenging, typically restricted to small sample sizes, single locations, and non-uniform design across studies, and has led to mixed results on the impact of the food environment 7–23 . Here we leverage smartphones to track diet health, operationalized through the self-reported consumption of fresh fruits and vegetables, fast food and soda, as well as body-mass index status in a country-wide observational study of 1,164,926 U.S. participants (MyFitnessPal app users) and 2.3 billion food entries to study the independent contributions of fast food and grocery store access, income and education to diet health outcomes. This study constitutes the largest nationwide study examining the relationship between the food environment and diet to date. We find that higher access to grocery stores, lower access to fast food, higher income and college education are independently associated with higher consumption of fresh fruits and vegetables, lower consumption of fast food and soda, and lower likelihood of being affected by overweight and obesity. However, these associations vary significantly across zip codes with predominantly Black, Hispanic or white populations. For instance, high grocery store access has a significantly larger association with higher fruit and vegetable consumption in zip codes with predominantly Hispanic populations (7.4% difference) and Black populations (10.2% difference) in contrast to zip codes with predominantly white populations (1.7% difference). Policy targeted at improving food access, income and education may increase healthy eating, but intervention allocation may need to be optimized for specific subpopulations and locations. 
    more » « less
  6. Abstract The COVID-19 pandemic has stimulated important changes in online information access as digital engagement became necessary to meet the demand for health, economic, and educational resources. Our analysis of 55 billion everyday web search interactions during the pandemic across 25,150 US ZIP codes reveals that the extent to which different communities of internet users enlist digital resources varies based on socioeconomic and environmental factors. For example, we find that ZIP codes with lower income intensified their access to health information to a smaller extent than ZIP codes with higher income. We show that ZIP codes with higher proportions of Black or Hispanic residents intensified their access to unemployment resources to a greater extent, while revealing patterns of unemployment site visits unseen by the claims data. Such differences frame important questions on the relationship between differential information search behaviors and the downstream real-world implications on more and less advantaged populations. 
    more » « less
  7. Making online social communities ‘better’ is a challenging undertaking, as online communities are extraordinarily varied in their size, topical focus, and governance. As such, what is valued by one community may not be valued by another.However, community values are challenging to measure as they are rarely explicitly stated.In this work, we measure community values through the first large-scale survey of community values, including 2,769 reddit users in 2,151 unique subreddits. Through a combination of survey responses and a quantitative analysis of publicly available reddit data, we characterize how these values vary within and across communities.Amongst other findings, we show that community members disagree about how safe their communities are, that longstanding communities place 30.1% more importance on trustworthiness than newer communities, and that community moderators want their communities to be 56.7% less democratic than non-moderator community members.These findings have important implications, including suggesting that care must be taken to protect vulnerable community members, and that participatory governance strategies may be difficult to implement.Accurate and scalable modeling of community values enables research and governance which is tuned to each community's different values. To this end, we demonstrate that a small number of automatically quantifiable features capture a significant yet limited amount of the variation in values between communities with a ROC AUC of 0.667 on a binary classification task.However, substantial variation remains, and modeling community values remains an important topic for future work.We make our models and data public to inform community design and governance. 
    more » « less
  8. Proper statistical modeling incorporates domain theory about how concepts relate and details of how data were measured. However, data analysts currently lack tool support for recording and reasoning about domain assumptions, data collection, and modeling choices in an integrated manner, leading to mistakes that can compromise scientific validity. For instance, generalized linear mixed-effects models (GLMMs) help answer complex research questions, but omitting random effects impairs the generalizability of results. To address this need, we present Tisane, a mixed-initiative system for authoring generalized linear models with and without mixed-effects. Tisane introduces a study design specification language for expressing and asking questions about relationships between variables. Tisane contributes an interactive compilation process that represents relationships in a graph, infers candidate statistical models, and asks follow-up questions to disambiguate user queries to construct a valid model. In case studies with three researchers, we find that Tisane helps them focus on their goals and assumptions while avoiding past mistakes. 
    more » « less
  9. Data analysis requires translating higher level questions and hypotheses into computable statistical models. We present a mixed-methods study aimed at identifying the steps, considerations, and challenges involved in operationalizing hypotheses into statistical models, a process we refer to as hypothesis formalization . In a formative content analysis of 50 research papers, we find that researchers highlight decomposing a hypothesis into sub-hypotheses, selecting proxy variables, and formulating statistical models based on data collection design as key steps. In a lab study, we find that analysts fixated on implementation and shaped their analyses to fit familiar approaches, even if sub-optimal. In an analysis of software tools, we find that tools provide inconsistent, low-level abstractions that may limit the statistical models analysts use to formalize hypotheses. Based on these observations, we characterize hypothesis formalization as a dual-search process balancing conceptual and statistical considerations constrained by data and computation and discuss implications for future tools. 
    more » « less