skip to main content


Title: Predicting intent behind selections in scatterplot visualizations
Predicting and capturing an analyst’s intent behind a selection in a data visualization is valuable in two scenarios: First, a successful prediction of a pattern an analyst intended to select can be used to auto-complete a partial selection which, in turn, can improve the correctness of the selection. Second, knowing the intent behind a selection can be used to improve recall and reproducibility. In this paper, we introduce methods to infer analyst’s intents behind selections in data visualizations, such as scatterplots. We describe intents based on patterns in the data, and identify algorithms that can capture these patterns. Upon an interactive selection, we compare the selected items with the results of a large set of computed patterns, and use various ranking approaches to identify the best pattern for an analyst’s selection. We store annotations and the metadata to reconstruct a selection, such as the type of algorithm and its parameterization, in a provenance graph. We present a prototype system that implements these methods for tabular data and scatterplots. Analysts can select a prediction to auto-complete partial selections and to seamlessly log their intents. We discuss implications of our approach for reproducibility and reuse of analysis workflows. We evaluate our approach in a crowd-sourced study, where we show that auto-completing selection improves accuracy, and that we can accurately capture pattern-based intent.  more » « less
Award ID(s):
1751238
NSF-PAR ID:
10322278
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
Information Visualization
Volume:
20
Issue:
4
ISSN:
1473-8716
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Mitrovic, A. ; Bosch, N. (Ed.)
    Regular expression (regex) coding has advantages for text analysis. Humans are often able to quickly construct intelligible coding rules with high precision. That is, researchers can identify words and word patterns that correctly classify examples of a particular concept. And, it is often easy to identify false positives and improve the regex classifier so that the positive items are accurately captured. However, ensuring that a regex list is complete is a bigger challenge, because the concepts to be identified in data are often sparsely distributed, which makes it difficult to identify examples of \textit{false negatives}. For this reason, regex-based classifiers suffer by having low recall. That is, it often misses items that should be classified as positive. In this paper, we provide a neural network solution to this problem by identifying a \textit{negative reversion set}, in which false negative items occur much more frequently than in the data set as a whole. Thus, the regex classifier can be more quickly improved by adding missing regexes based on the false negatives found from the negative reversion set. This study used an existing data set collected from a simulation-based learning environment for which researchers had previously defined six codes and developed classifiers with validated regex lists. We randomly constructed incomplete (partial) regex lists and used neural network models to identify negative reversion sets in which the frequency of false negatives increased from a range of 3\\%-8\\% in the full data set to a range of 12\\%-52\\% in the negative reversion set. Based on this finding, we propose an interactive coding mechanism in which human-developed regex classifiers provide input for training machine learning algorithms and machine learning algorithms ``smartly" select highly suspected false negative items for human to more quickly develop regex classifiers. 
    more » « less
  2. null (Ed.)
    Users often need to look through multiple search result pages or reformulate queries when they have complex information-seeking needs. Conversational search systems make it possible to improve user satisfaction by asking questions to clarify users’ search intents. This, however, can take significant effort to answer a series of questions starting with “what/why/how”. To quickly identify user intent and reduce effort during interactions, we propose an intent clarification task based on yes/no questions where the system needs to ask the correct question about intents within the fewest conversation turns. In this task, it is essential to use negative feedback about the previous questions in the conversation history. To this end, we propose a Maximum-Marginal-Relevance (MMR) based BERT model (MMR-BERT) to leverage negative feedback based on the MMR principle for the next clarifying question selection. Experiments on the Qulac dataset show that MMR-BERT outperforms state-of-the-art baselines significantly on the intent identification task and the selected questions also achieve significantly better performance in the associated document retrieval tasks. 
    more » « less
  3. Abstract

    Sudden stratospheric warmings (SSWs) are the most dramatic events in the wintertime stratosphere. Such extreme events are characterized by substantial disruption to the stratospheric polar vortex, which can be categorized into displacement and splitting types depending on the morphology of the disrupted vortex. Moreover, SSWs are usually followed by anomalous tropospheric circulation regimes that are important for subseasonal-to-seasonal prediction. Thus, monitoring the genesis and evolution of SSWs is crucial and deserves further advancement. Despite several analysis methods that have been used to study the evolution of SSWs, the ability of deep learning methods has not yet been explored, mainly due to the relative scarcity of observed events. To overcome the limited observational sample size, we use data from historical simulations of the Whole Atmosphere Community Climate Model version 6 to identify thousands of simulated SSWs, and use their spatial patterns to train the deep learning model. We utilize a convolutional neural network combined with a variational auto-encoder (VAE)—a generative deep learning model—to construct a phase diagram that characterizes the SSW evolution. This approach not only allows us to create a latent space that encapsulates the essential features of the vortex structure during SSWs, but also offers new insights into its spatiotemporal evolution mapping onto the phase diagram. The constructed phase diagram depicts a continuous transition of the vortex pattern during SSWs. Notably, it provides a new perspective for discussing the evolutionary paths of SSWs: the VAE gives a better-reconstructed vortex morphology and more clearly organized vortex regimes for both displacement-type and split-type events than those obtained from principal component analysis. Our results provide an innovative phase diagram to portray the evolution of SSWs, in which particularly the splitting SSWs are better characterized. Our findings support the future use of deep learning techniques to study the underlying dynamics of extreme stratospheric vortex phenomena, and to establish a benchmark to evaluate model performance in simulating SSWs.

     
    more » « less
  4. null (Ed.)
    The use of hydro-meteorological forecasts in water resources management holds great promise as a soft pathway to improve system performance. Methods for generating synthetic forecasts of hydro-meteorological variables are crucial for robust validation of forecast use, as numerical weather prediction hindcasts are only available for a relatively short period (10–40 years) that is insufficient for assessing risk related to forecast-informed decision-making during extreme events. We develop a generalized error model for synthetic forecast generation that is applicable to a range of forecasted variables used in water resources management. The approach samples from the distribution of forecast errors over the available hindcast period and adds them to long records of observed data to generate synthetic forecasts. The approach utilizes the Skew Generalized Error Distribution (SGED) to model marginal distributions of forecast errors that can exhibit heteroskedastic, auto-correlated, and non-Gaussian behavior. An empirical copula is used to capture covariance between variables, forecast lead times, and across space. We demonstrate the method for medium-range forecasts across Northern California in two case studies for (1) streamflow and (2) temperature and precipitation, which are based on hindcasts from the NOAA/NWS Hydrologic Ensemble Forecast System (HEFS) and the NCEP GEFS/R V2 climate model, respectively. The case studies highlight the flexibility of the model and its ability to emulate space-time structures in forecasts at scales critical for water resources management. The proposed method is generalizable to other locations and computationally efficient, enabling fast generation of long synthetic forecast ensembles that are appropriate for risk analysis. 
    more » « less
  5. Abstract

    The use of hydro‐meteorological forecasts in water resources management holds great promise as a soft pathway to improve system performance. Methods for generating synthetic forecasts of hydro‐meteorological variables are crucial for robust validation of forecast use, as numerical weather prediction hindcasts are only available for a relatively short period (10–40 years) that is insufficient for assessing risk related to forecast‐informed decision‐making during extreme events. We develop a generalized error model for synthetic forecast generation that is applicable to a range of forecasted variables used in water resources management. The approach samples from the distribution of forecast errors over the available hindcast period and adds them to long records of observed data to generate synthetic forecasts. The approach utilizes the Skew Generalized Error Distribution (SGED) to model marginal distributions of forecast errors that can exhibit heteroskedastic, auto‐correlated, and non‐Gaussian behavior. An empirical copula is used to capture covariance between variables, forecast lead times, and across space. We demonstrate the method for medium‐range forecasts across Northern California in two case studies for (1) streamflow and (2) temperature and precipitation, which are based on hindcasts from the NOAA/NWS Hydrologic Ensemble Forecast System (HEFS) and the NCEP GEFS/R V2 climate model, respectively. The case studies highlight the flexibility of the model and its ability to emulate space‐time structures in forecasts at scales critical for water resources management. The proposed method is generalizable to other locations and computationally efficient, enabling fast generation of long synthetic forecast ensembles that are appropriate for risk analysis.

     
    more » « less