skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Deeper Investigation of the Importance of Wikipedia Links to Search Engine Results
A growing body of work has highlighted the important role that Wikipedia's volunteer-created content plays in helping search engines achieve their core goal of addressing the information needs of hundreds of millions of people. In this paper, we report the results of an investigation into the incidence of Wikipedia links in search engine results pages (SERPs). Our results extend prior work by considering three U.S. search engines, simulating both mobile and desktop devices, and using a spatial analysis approach designed to study modern SERPs that are no longer just "ten blue links". We find that Wikipedia links are extremely common in important search contexts, appearing in 67-84% of desktop SERPs for common and trending queries, but less often for medical queries. Furthermore, we observe that Wikipedia links often appear in "Knowledge Panel" SERP elements and are in positions visible to users without scrolling, although Wikipedia appears less often and in less prominent positions on mobile devices. Our findings reinforce the complementary notions that (1) Wikipedia content and research has major impact outside of the Wikipedia domain and (2) powerful technologies like search engines are highly reliant on free content created by volunteers.  more » « less
Award ID(s):
1815507
PAR ID:
10404672
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the ACM on Human-Computer Interaction
Volume:
5
Issue:
CSCW1
ISSN:
2573-0142
Page Range / eLocation ID:
1 to 15
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Search engines are some of the most popular and profitable intelligent technologies in existence. Recent research, however, has suggested that search engines may be surprisingly dependent on user-created content like Wikipedia articles to address user information needs. In this paper, we perform a rigorous audit of the extent to which Google leverages Wikipedia and other user-generated content to respond to queries. Analyzing results for six types of important queries (e.g. most popular, trending, expensive advertising), we observe that Wikipedia appears in over 80% of results pages for some query types and is by far the most prevalent individual content source across all query types. More generally, our results provide empirical information to inform a nascent but rapidly-growing debate surrounding a highly consequential question: Do users provide enough value to intelligent technologies that they should receive more of the economic benefits from intelligent technologies? 
    more » « less
  2. We present a new search method for mathematical formulas based on Operator Trees (OPTs) representing the application of operators to operands. Our method provides (1) a simple indexing scheme using OPT leaf-root paths, (2) practical matching of the K largest common subexpressions, and (3) scoring matched OPT subtrees by counting nodes corresponding to visible symbols, weighting operators lower than operands. Using the largest common subexpression (K=1), we outperform existing formula search engines for non-wildcard queries on the NTCIR-12 Wikipedia Formula Browsing Task. Stronger results are obtained when using additional subexpressions for scoring. Without parallelization or pruning, our system has practical execution times with low variance when compared to other state-of-the-art formula search engines. 
    more » « less
  3. null (Ed.)
    Search engines are perceived as a reliable source for general information needs. However, finding the answer to medical questions using search engines can be challenging for an ordinary user. Content can be biased and results may present different opinions. In addition, interpreting medically related content can be difficult for users with no medical background. All of these can lead users to incorrect conclusions regarding health related questions. In this work we address this problem from two perspectives. First, to gain insight on users' ability to correctly answer medical questions using search engines, we conduct a comprehensive user study. We show that for questions regarding medical treatment effectiveness, participants struggle to find the correct answer and are prone to overestimating treatment effectiveness. We analyze participants' demographic traits according to age and education level and show that this problem persists in all demographic groups. We then propose a semi-automatic machine learning approach to find the correct answer to queries on medical treatment effectiveness as it is viewed by the medical community. The model relies on the opinions presented in medical papers related to the queries, as well as features representing their impact. We show that, compared to human behaviour, our method is less prone to bias. We compare various configurations of our inference model and a baseline method that determines treatment effectiveness based solely on the opinion of medical papers. The results bolster our confidence that our approach can pave the way to developing automatic bias-free tools that can help mediate complex health related content to users. 
    more » « less
  4. A search engine's ability to retrieve desirable datasets is important for data sharing and reuse. Existing dataset search engines typically rely on matching queries to dataset descriptions. However, a user may not have enough prior knowledge to write a query using terms that match with description text. We propose a novel schema label generation model which generates possible schema labels based on dataset table content. We incorporate the generated schema labels into a mixed ranking model which not only considers the relevance between the query and dataset metadata but also the similarity between the query and generated schema labels. To evaluate our method on real-world datasets, we create a new benchmark specifically for the dataset retrieval task. Experiments show that our approach can effectively improve the precision and NDCG scores of the dataset retrieval task compared with baseline methods. We also test on a collection of Wikipedia tables to show that the features generated from schema labels can improve the unsupervised and supervised web table retrieval task as well. 
    more » « less
  5. Author Name Disambiguation (AND) is the task of clustering unique author names from publication records in scholarly or related databases. Although AND has been extensively studied and has served as an important preprocessing step for several tasks (e.g. calculating bibliometrics and scientometrics for authors), there are few publicly available tools for disambiguation in large-scale scholarly databases. Furthermore, most of the disambiguated data is embedded within the search engines of the scholarly databases, and existing application programming interfaces (APIs) have limited features and are often unavailable for users for various reasons. This makes it difficult for researchers and developers to use the data for various applications (e.g. author search) or research. Here, we design a novel, web-based, RESTful API for searching disambiguated authors, using the PubMed database as a sample application. We offer two type of queries, attribute-based queries and record-based queries which serve different purposes. Attribute-based queries retrieve authors with the attributes available in the database. We study different search engines to find the most appropriate one for processing attribute-based queries. Record-based queries retrieve authors that are most likely to have written a query publication provided by a user. To accelerate record-based queries, we develop a novel algorithm that has a fast record-to-cluster match. We show that our algorithm can accelerate the query by a factor of 4.01 compared to a baseline naive approach. 
    more » « less