skip to main content


Title: Measuring the Importance of User-Generated Content to Search Engines
Search engines are some of the most popular and profitable intelligent technologies in existence. Recent research, however, has suggested that search engines may be surprisingly dependent on user-created content like Wikipedia articles to address user information needs. In this paper, we perform a rigorous audit of the extent to which Google leverages Wikipedia and other user-generated content to respond to queries. Analyzing results for six types of important queries (e.g. most popular, trending, expensive advertising), we observe that Wikipedia appears in over 80% of results pages for some query types and is by far the most prevalent individual content source across all query types. More generally, our results provide empirical information to inform a nascent but rapidly-growing debate surrounding a highly consequential question: Do users provide enough value to intelligent technologies that they should receive more of the economic benefits from intelligent technologies?  more » « less
Award ID(s):
1815507 1707296 1526988
NSF-PAR ID:
10109461
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the 2019 International AAAI Conference on Web and Social Media
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A growing body of work has highlighted the important role that Wikipedia's volunteer-created content plays in helping search engines achieve their core goal of addressing the information needs of hundreds of millions of people. In this paper, we report the results of an investigation into the incidence of Wikipedia links in search engine results pages (SERPs). Our results extend prior work by considering three U.S. search engines, simulating both mobile and desktop devices, and using a spatial analysis approach designed to study modern SERPs that are no longer just "ten blue links". We find that Wikipedia links are extremely common in important search contexts, appearing in 67-84% of desktop SERPs for common and trending queries, but less often for medical queries. Furthermore, we observe that Wikipedia links often appear in "Knowledge Panel" SERP elements and are in positions visible to users without scrolling, although Wikipedia appears less often and in less prominent positions on mobile devices. Our findings reinforce the complementary notions that (1) Wikipedia content and research has major impact outside of the Wikipedia domain and (2) powerful technologies like search engines are highly reliant on free content created by volunteers. 
    more » « less
  2. Modern web search engines exploit users' search history to personalize search results, with a goal of improving their service utility on a per-user basis. But it is this very dimension that leads to the risk of privacy infringement and raises serious public concerns. In this work, we propose a client-centered intent-aware query obfuscation solution for protecting user privacy in a personalized web search scenario. In our solution, each user query is submitted with l additional cover queries and corresponding clicks, which act as decoys to mask users' genuine search intent from a search engine. The cover queries are sequentially sampled from a set of hierarchically organized language models to ensure the coherency of fake search intents in a cover search task. Our approach emphasizes the plausibility of generated cover queries, not only to the current genuine query but also to previous queries in the same task, to increase the complexity for a search engine to identify a user's true intent. We also develop two new metrics from an information theoretic perspective to evaluate the effectiveness of provided privacy protection. Comprehensive experiment comparisons with state-of-the-art query obfuscation techniques are performed on the public AOL search log, and the propitious results substantiate the effectiveness of our solution. 
    more » « less
  3. Analyzing queries from search engines and intelligent assistants is difficult. A key challenge is organizing queries into interpretable, context-preserving, representative, and flexible groups. We present structural templates, abstract queries that replace tokens with their linguistic feature forms, as a query grouping method. The templates allow analysts to create query groups with structural similarity at different granularities. We introduce Tempura, an interactive tool that lets analysts explore a query dataset with structural templates. Tempura summarizes a query dataset by selecting a representative subset of templates to show the query distribution. The tool also helps analysts navigate the template space by suggesting related templates likely to yield further explorations. Our user study shows that Tempura helps analysts examine the distribution of a query dataset, find labeling errors, and discover model error patterns and outliers. 
    more » « less
  4. Query understanding plays a key role in exploring users’ search intents. However, it is inherently challenging since it needs to capture semantic information from short and ambiguous queries and often requires massive task-specific labeled data. In recent years, pre-trained language models (PLMs) have advanced various natural language processing tasks because they can extract general semantic information from large-scale corpora. However, directly applying them to query understanding is sub-optimal because existing strategies rarely consider to boost the search performance. On the other hand, search logs contain user clicks between queries and urls that provide rich users’ search behavioral information on queries beyond their content. Therefore, in this paper, we aim to fill this gap by exploring search logs. In particular, we propose a novel graph-enhanced pre-training framework, GE-BERT, which leverages both query content and the query graph to capture both semantic information and users’ search behavioral information of queries. Extensive experiments on offline and online tasks have demonstrated the effectiveness of the proposed framework. 
    more » « less
  5. A search engine's ability to retrieve desirable datasets is important for data sharing and reuse. Existing dataset search engines typically rely on matching queries to dataset descriptions. However, a user may not have enough prior knowledge to write a query using terms that match with description text. We propose a novel schema label generation model which generates possible schema labels based on dataset table content. We incorporate the generated schema labels into a mixed ranking model which not only considers the relevance between the query and dataset metadata but also the similarity between the query and generated schema labels. To evaluate our method on real-world datasets, we create a new benchmark specifically for the dataset retrieval task. Experiments show that our approach can effectively improve the precision and NDCG scores of the dataset retrieval task compared with baseline methods. We also test on a collection of Wikipedia tables to show that the features generated from schema labels can improve the unsupervised and supervised web table retrieval task as well. 
    more » « less