skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Exploratory and directed search strategies at a social science data archive
Researchers need to be able to find, access, and use data to participate in open science. To understand how users search for research data, we analyzed textual queries issued at a large social science data archive, the Inter-university Consortium for Political and Social Research (ICPSR). We collected unique user queries from 988,475 user search sessions over four years (2012-16). Overall, we found that only 30% of site visitors entered search terms into the ICPSR website. We analyzed search strategies within these sessions by extending existing dataset search taxonomies to classify a subset of the 1,554 most popular queries. We identified five categories of commonly-issued queries: keyword-based (e.g., date, place, topic); name (e.g., study, series); identifier (e.g., study, series); author (e.g., institutional, individual); and type (e.g., file, format). While the dominant search strategy used short keywords to explore topics, directed searches for known items using study and series names were also common. We further distinguished exploratory browsing from directed search queries based on their page views, refinements, search depth, duration, and length. Directed queries were longer (i.e., they had more words), while sessions with exploratory queries had more refinements and associated page views. By comparing search interactions at ICPSR to other natural language interactions in similar web search contexts, we conclude that dataset search at ICPSR is underutilized. We envision how alternative search paradigms, such as those enabled by recommender systems, can enhance dataset search.  more » « less
Award ID(s):
2121789
PAR ID:
10574917
Author(s) / Creator(s):
; ;
Publisher / Repository:
IASSIST
Date Published:
Journal Name:
IASSIST Quarterly
Volume:
48
Issue:
1
ISSN:
2331-4141
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This dataset contains trace data describing user interactions with the Inter-university Consortium for Political and Social Research website (ICPSR). We gathered site usage data from Google Analytics. We focused our analysis on user sessions, which are groups of interactions with resources (e.g., website pages) and events initiated by users. ICPSR tracks a subset of user interactions (i.e., other than page views) through event triggers. We analyzed sequences of interactions with resources, including the ICPSR data catalog, variable index, data citations collected in the ICPSR Bibliography of Data-related Literature, and topical information about project archives. As part of our analysis, we calculated the total number of unique sessions and page views in the study period. Data in our study period fell between September 1, 2012, and 2016. ICPSR's website was updated and relaunched in September 2012 with new search functionality, including a Social Science Variables Database (SSVD) tool. ICPSR then reorganized its website and changed its analytics collection procedures in 2016, marking this as the cutoff date for our analysis. Data are relevant for two reasons. First, updates to the ICPSR website during the study period focused only on front-end design rather than the website's search functionality. Second, the core features of the website over the period we examined (e.g., faceted and variable search, standardized metadata, the use of controlled vocabularies, and restricted data applications) are shared with other major data archives, making it likely that the trends in user behavior we report are generalizable. 
    more » « less
  2. Social scientists increasingly share data so others can evaluate, replicate, and extend their research. To understand the process of data discovery as a precursor to data use, we study prospective users’ interactions with archived data. We gathered data for 98,000 user sessions initiated at a large social science data archive, the Inter-university Consortium for Political and Social Research (ICPSR). Our data reflect four years (2012-16) of users’ interactions with archival resources, including a data catalog, study-level metadata, variables, and publications that cite nearly 10,000 datasets. We constructed a network of user interactions linking website landing (e.g., site entrances) to exit pages, from which we identified three types of paths that users take through the research data archive: direct, orienting, and scenic. We also interpreted points of failure (e.g., drop-offs) and recurring behaviors (e.g., sensemaking) that support or impede data discovery along search paths. We articulate strategies that users adopt as they navigate data search and suggest ways to enhance the accessibility of data, metadata, and the systems that organize each. 
    more » « less
  3. Table search aims to answer a query with a ranked list of tables. Unfortunately, current test corpora have focused mostly on needle- in-the-haystack tasks, where only a few tables are expected to exactly match the query intent. Instead, table search tasks often arise in response to the need for retrieving new datasets or augment- ing existing ones, e.g., for data augmentation within data science or machine learning pipelines. Existing table repositories and bench- marks are limited in their ability to test retrieval methods for table search tasks. Thus, to close this gap, we introduce a novel dataset for query-by-example Semantic Table Search. This novel dataset con- sists of two snapshots of the large-scale Wikipedia tables collection from 2013 and 2019 with two important additions: (1) a page and topic aware ground truth relevance judgment and (2) a large-scale DBpedia entity linking annotation. Moreover, we generate a novel set of entity-centric queries that allows testing existing methods under a novel search scenario: semantic exploratory search. The resulting resource consists of 9,296 novel queries, 610,553 query- table relevance annotations, and 238,038 entity-linked tables from the 2013 snapshot. Similarly, on the 2019 snapshot, the resource consists of 2,560 queries, 958,214 relevance annotations, and 457,714 total tables. This makes our resource the largest annotated table- search corpus to date (97 times more queries and 956 times more annotated tables than any existing benchmark). We perform a user study among domain experts and prove that these annotators agree with the automatically generated relevance annotations. As a re- sult, we can re-evaluate some basic assumptions behind existing table search approaches identifying their shortcomings along with promising novel research directions. 
    more » « less
  4. null (Ed.)
    The COVID-19 pandemic has sparked un- precedented mobilization of scientists, gener- ating a deluge of papers that makes it hard for researchers to keep track and explore new directions. Search engines are designed for targeted queries, not for discovery of con- nections across a corpus. In this paper, we present SciSight, a system for exploratory search of COVID-19 research integrating two key capabilities: first, exploring associations between biomedical facets automatically ex- tracted from papers (e.g., genes, drugs, dis- eases, patient outcomes); second, combining textual and network information to search and visualize groups of researchers and their ties. SciSight1 has so far served over 15K users with over 42K page views and 13% returns 
    more » « less
  5. Abstract Many theories of human information behavior (HIB) assume that information objects are in text document format. This paper argues four important HIB theories are insufficient for describing users' search strategies for data because of assumptions about the attributes of objects that users seek. We first review and compare four HIB theories: Bates'berrypicking, Marchionni'selectronic information search, Dervin'ssense‐making, and Meho and Tibbo'ssocial scientist information‐seeking. All four theories assume that information‐seekers search for text documents. Next, we compare these theories to search behavior by analyzing Google Analytics data from the Inter‐university Consortium for Political and Social Research (ICPSR). Users took direct, scenic, and orienting paths when searching for data. We also interviewed ICPSR users (n = 20), and they said they needed dataset documentation and contextual information to find data. However, Dervin'ssense‐makingalone cannot explain the information‐seeking behaviors that we observed. Instead, what mattered most were object attributes determined by the type of information that users sought (i.e., data, not documents). We conclude by suggesting an alternative frame for building user‐centered data discovery tools. 
    more » « less