NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Which Evaluations Uncover Sense Representations that Actually Make Sense?

Jordan Boyd-Graber, Fenfei Guo (May 2020, Proceedings of the 12th Language Resources and Evaluation Conference)
null (Ed.)
Text representations are critical for modern natural language processing. One form of text representation, sense-specific embeddings, reflect a word’s sense in a sentence better than single-prototype word embeddings tied to each type. However, existing sense representations are not uniformly better: although they work well for computer-centric evaluations, they fail for human-centric tasks like inspecting a language’s sense inventory. To expose this discrepancy, we propose a new coherence evaluation for sense embeddings. We also describe a minimal model (Gumbel Attention for Sense Induction) optimized for discovering interpretable sense representations that are more coherent than existing sense embeddings.
more » « less
Full Text Available
Digging into user control: perceptions of adherence and instability in transparent models

https://doi.org/10.1145/3377325.3377491

Smith-Renner, Alison; Kumar, Varun; Boyd-Graber, Jordan; Seppi, Kevin; Findlater, Leah (March 2020, IUI '20: Proceedings of the 25th International Conference on Intelligent User Interfaces)
null (Ed.)
We explore predictability and control in interactive systems where controls are easy to validate. Human-in-the-loop techniques allow users to guide unsupervised algorithms by exposing and supporting interaction with underlying model representations, increasing transparency and promising fine-grained control. However, these models must balance user input and the underlying data, meaning they sometimes update slowly, poorly, or unpredictably---either by not incorporating user input as expected (adherence) or by making other unexpected changes (instability). While prior work exposes model internals and supports user feedback, less attention has been paid to users' reactions when transparent models limit control. Focusing on interactive topic models, we explore user perceptions of control using a study where 100 participants organize documents with one of three distinct topic modeling approaches. These approaches incorporate input differently, resulting in varied adherence, stability, update speeds, and model quality. Participants disliked slow updates most, followed by lack of adherence. Instability was polarizing: some participants liked it when it surfaced interesting information, while others did not. Across modeling approaches, participants differed only in whether they noticed adherence.
more » « less
Full Text Available
No Explainability without Accountability: An Empirical Study of Explanations and Feedback in Interactive ML

https://doi.org/10.1145/3313831.3376624

Smith-Renner, Alison; Fan, Ron; Birchfield, Melissa; Wu, Tongshuang; Boyd-Graber, Jordan; Weld, Daniel S.; Findlater, Leah (January 2020, CHI '20: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems)
null (Ed.)
Automatically generated explanations of how machine learning (ML) models reason can help users understand and accept them. However, explanations can have unintended consequences: promoting over-reliance or undermining trust. This paper investigates how explanations shape users' perceptions of ML models with or without the ability to provide feedback to them: (1) does revealing model flaws increase users' desire to "fix" them; (2) does providing explanations cause users to believe - wrongly - that models are introspective, and will thus improve over time. Through two controlled experiments - varying model quality - we show how the combination of explanations and user feedback impacted perceptions, such as frustration and expectations of model improvement. Explanations without opportunity for feedback were frustrating with a lower quality model, while interactions between explanation and feedback for the higher quality model suggest that detailed feedback should not be requested without explanation. Users expected model correction, regardless of whether they provided feedback or received explanations.
more » « less
Full Text Available
Why Didn’t You Listen to Me? Comparing User Control of Human-in-the-Loop Topic Models

https://doi.org/10.18653/v1/P19-1637

Kumar, Varun; Smith-Renner, Alison; Findlater, Leah; Seppi, Kevin; Boyd-Graber, Jordan (January 2019, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics)
null (Ed.)
To address the lack of comparative evaluation of Human-in-the-Loop Topic Modeling (HLTM) systems, we implement and evaluate three contrasting HLTM modeling approaches using simulation experiments. These approaches extend previously proposed frameworks, including constraints and informed prior-based methods. Users should have a sense of control in HLTM systems, so we propose a control metric to measure whether refinement operations’ results match users’ expectations. Informed prior-based methods provide better control than constraints, but constraints yield higher quality topics.
more » « less
Full Text Available
Automatic Evaluation of Local Topic Quality

https://doi.org/10.18653/v1/P19-1076

Lund, Jeffrey; Armstrong, Piper; Fearn, Wilson; Cowley, Stephen; Byun, Courtni; Boyd-Graber, Jordan; Seppi, Kevin (January 2019, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics)
null (Ed.)
Topic models are typically evaluated with respect to the global topic distributions that they generate, using metrics such as coherence, but without regard to local (token-level) topic assignments. Token-level assignments are important for downstream tasks such as classification. Even recent models, which aim to improve the quality of these token-level topic assignments, have been evaluated only with respect to global metrics. We propose a task designed to elicit human judgments of token-level topic assignments. We use a variety of topic model types and parameters and discover that global metrics agree poorly with human assignments. Since human evaluation is expensive we propose a variety of automated metrics to evaluate topic models at a local level. Finally, we correlate our proposed metrics with human judgments from the task on several datasets. We show that an evaluation based on the percent of topic switches correlates most strongly with human judgment of local topic quality. We suggest that this new metric, which we call consistency, be adopted alongside global metrics such as topic coherence when evaluating new topic models.
more » « less
Full Text Available

Search for: All records