Researchers need to be able to find, access, and use data to participate in open science. To understand how users search for research data, we analyzed textual queries issued at a large social science data archive, the Inter-university Consortium for Political and Social Research (ICPSR). We collected unique user queries from 988,475 user search sessions over four years (2012-16). Overall, we found that only 30% of site visitors entered search terms into the ICPSR website. We analyzed search strategies within these sessions by extending existing dataset search taxonomies to classify a subset of the 1,554 most popular queries. We identified five categories of commonly-issued queries: keyword-based (e.g., date, place, topic); name (e.g., study, series); identifier (e.g., study, series); author (e.g., institutional, individual); and type (e.g., file, format). While the dominant search strategy used short keywords to explore topics, directed searches for known items using study and series names were also common. We further distinguished exploratory browsing from directed search queries based on their page views, refinements, search depth, duration, and length. Directed queries were longer (i.e., they had more words), while sessions with exploratory queries had more refinements and associated page views. By comparing search interactions at ICPSR to other natural language interactions in similar web search contexts, we conclude that dataset search at ICPSR is underutilized. We envision how alternative search paradigms, such as those enabled by recommender systems, can enhance dataset search. 
                        more » 
                        « less   
                    
                            
                            Transforming Data Discovery Through Behavior Modeling and Recommendation - Google Analytics Trace Data
                        
                    
    
            This dataset contains trace data describing user interactions with the Inter-university Consortium for Political and Social Research website (ICPSR). We gathered site usage data from Google Analytics. We focused our analysis on user sessions, which are groups of interactions with resources (e.g., website pages) and events initiated by users. ICPSR tracks a subset of user interactions (i.e., other than page views) through event triggers. We analyzed sequences of interactions with resources, including the ICPSR data catalog, variable index, data citations collected in the ICPSR Bibliography of Data-related Literature, and topical information about project archives. As part of our analysis, we calculated the total number of unique sessions and page views in the study period. Data in our study period fell between September 1, 2012, and 2016. ICPSR's website was updated and relaunched in September 2012 with new search functionality, including a Social Science Variables Database (SSVD) tool. ICPSR then reorganized its website and changed its analytics collection procedures in 2016, marking this as the cutoff date for our analysis. Data are relevant for two reasons. First, updates to the ICPSR website during the study period focused only on front-end design rather than the website's search functionality. Second, the core features of the website over the period we examined (e.g., faceted and variable search, standardized metadata, the use of controlled vocabularies, and restricted data applications) are shared with other major data archives, making it likely that the trends in user behavior we report are generalizable. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2121789
- PAR ID:
- 10574919
- Publisher / Repository:
- ICPSR - Interuniversity Consortium for Political and Social Research
- Date Published:
- Subject(s) / Keyword(s):
- information retrival data archiving search web analytics
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Social scientists increasingly share data so others can evaluate, replicate, and extend their research. To understand the process of data discovery as a precursor to data use, we study prospective users’ interactions with archived data. We gathered data for 98,000 user sessions initiated at a large social science data archive, the Inter-university Consortium for Political and Social Research (ICPSR). Our data reflect four years (2012-16) of users’ interactions with archival resources, including a data catalog, study-level metadata, variables, and publications that cite nearly 10,000 datasets. We constructed a network of user interactions linking website landing (e.g., site entrances) to exit pages, from which we identified three types of paths that users take through the research data archive: direct, orienting, and scenic. We also interpreted points of failure (e.g., drop-offs) and recurring behaviors (e.g., sensemaking) that support or impede data discovery along search paths. We articulate strategies that users adopt as they navigate data search and suggest ways to enhance the accessibility of data, metadata, and the systems that organize each.more » « less
- 
            Abstract Many theories of human information behavior (HIB) assume that information objects are in text document format. This paper argues four important HIB theories are insufficient for describing users' search strategies for data because of assumptions about the attributes of objects that users seek. We first review and compare four HIB theories: Bates'berrypicking, Marchionni'selectronic information search, Dervin'ssense‐making, and Meho and Tibbo'ssocial scientist information‐seeking. All four theories assume that information‐seekers search for text documents. Next, we compare these theories to search behavior by analyzing Google Analytics data from the Inter‐university Consortium for Political and Social Research (ICPSR). Users took direct, scenic, and orienting paths when searching for data. We also interviewed ICPSR users (n = 20), and they said they needed dataset documentation and contextual information to find data. However, Dervin'ssense‐makingalone cannot explain the information‐seeking behaviors that we observed. Instead, what mattered most were object attributes determined by the type of information that users sought (i.e., data, not documents). We conclude by suggesting an alternative frame for building user‐centered data discovery tools.more » « less
- 
            This paper describes a machine learning approach for annotating and analyzing data curation work logs at ICPSR, a large social sciences data archive. The systems we studied track curation work and coordinate team decision-making at ICPSR. Archive staff use these systems to organize, prioritize, and document curation work done on datasets, making them promising resources for studying curation work and its impact on data reuse, especially in combination with data usage analytics. A key challenge, however, is classifying similar activities so that they can be measured and associated with impact metrics. This paper contributes: 1) a set of data curation activities; 2) a computational model for identifying curation actions in work log descriptions; and 3) an analysis of frequent data curation activities at ICPSR over time. We first propose a set of data curation actions to help us analyze the impact of curation work. We then use this set to annotate a set of data curation logs, which contain records of data transformations and project management decisions completed by archive staff. Finally, we train a text classifier to detect the frequency of curation actions in a large set of work logs. Our approach supports the analysis of curation work documented in work log systems as an important step toward studying the relationship between research data curation and data reuse.more » « less
- 
            Data users need relevant context and research expertise to effectively search for and identify relevant datasets. Leading data providers, such as the Inter‐university Consortium for Political and Social Research (ICPSR), offer standardized metadata and search tools to support data search. Metadata standards emphasize the machine‐readability of data and its documentation. There are opportunities to enhance dataset search by improving users' ability to learn about, and make sense of, information about data. Prior research has shown that context and expertise are two main barriers users face in effectively searching for, evaluating, and deciding whether to reuse data. In this paper, we propose a novel chatbot‐based search system, DataChat, that leverages a graph database and a large language model to provide novel ways for users to interact with and search for research data. DataChat complements data archives' and institutional repositories' ongoing efforts to curate, preserve, and share research data for reuse by making it easier for users to explore and learn about available research data.more » « less
 An official website of the United States government
An official website of the United States government 
				
			
