skip to main content


Title: Semantically-Enriched Search Engine for Geoportals: A Case Study with ArcGIS Online
Abstract. Many geoportals such as ArcGIS Online are established with the goal of improving geospatial data reusability and achieving intelligent knowledge discovery. However, according to previous research, most of the existing geoportals adopt Lucene-based techniques to achieve their core search functionality, which has a limited ability to capture the user’s search intentions. To better understand a user’s search intention, query expansion can be used to enrich the user’s query by adding semantically similar terms. In the context of geoportals and geographic information retrieval, we advocate the idea of semantically enriching a user’s query from both geospatial and thematic perspectives. In the geospatial aspect, we propose to enrich a query by using both place partonomy and distance decay. In terms of the thematic aspect, concept expansion and embedding-based document similarity are used to infer the implicit information hidden in a user’s query. This semantic query expansion framework is implemented as a semantically-enriched search engine using ArcGIS Online as a case study. A benchmark dataset is constructed to evaluate the proposed framework. Our evaluation results show that the proposed semantic query expansion framework is very effective in capturing a user’s search intention and significantly outperforms a well-established baseline – Lucene’s practical scoring function – with more than 3.0 increments in DCG@K (K=3,5,10).  more » « less
Award ID(s):
1936677
NSF-PAR ID:
10191829
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
AGILE: GIScience Series
Volume:
1
ISSN:
2700-8150
Page Range / eLocation ID:
1 to 17
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. One longstanding complication with Earth data discovery involves understanding a user’s search intent from the input query. Most of the geospatial data portals use keyword-based match to search data. Little attention has focused on the spatial and temporal information from a query or understanding the query with ontology. No research in the geospatial domain has investigated user queries in a systematic way. Here, we propose a query understanding framework and apply it to fill the gap by better interpreting a user’s search intent for Earth data search engines and adopting knowledge that was mined from metadata and user query logs. The proposed query understanding tool contains four components: spatial and temporal parsing; concept recognition; Named Entity Recognition (NER); and, semantic query expansion. Spatial and temporal parsing detects the spatial bounding box and temporal range from a query. Concept recognition isolates clauses from free text and provides the search engine phrases instead of a list of words. Name entity recognition detects entities from the query, which inform the search engine to query the entities detected. The semantic query expansion module expands the original query by adding synonyms and acronyms to phrases in the query that was discovered from Web usage data and metadata. The four modules interact to parse a user’s query from multiple perspectives, with the goal of understanding the consumer’s quest intent for data. As a proof-of-concept, the framework is applied to oceanographic data discovery. It is demonstrated that the proposed framework accurately captures a user’s intent. 
    more » « less
  2. Knowledge graph question answering aims to identify answers of the query according to the facts in the knowledge graph. In the vast majority of the existing works, the input queries are considered perfect and can precisely express the user’s query intention. However, in reality, input queries might be ambiguous and elusive which only contain a limited amount of information. Directly answering these ambiguous queries may yield unwanted answers and deteriorate user experience. In this paper, we propose PReFNet which focuses on answering ambiguous queries with pseudo relevance feedback on knowledge graphs. In order to leverage the hidden (pseudo) relevance information existed in the results that are initially returned from a given query, PReFNet treats the top-k returned candidate answers as a set of most relevant answers, and uses variational Bayesian inference to infer user’s query intention. To boost the quality of the inferred queries, a neighborhood embedding based VGAE model is used to prune inferior inferred queries. The inferred high quality queries will be returned to the users to help them search with ease. Moreover, all the high-quality candidate nodes will be re-ranked according to the inferred queries. The experiment results show that our proposed method can recommend high-quality query graphs to users and improve the question answering accuracy. 
    more » « less
  3. Wren, Jonathan (Ed.)
    Abstract Motivation Substance abuse constitutes one of the major contemporary health epidemics. Recently, the use of social media platforms has garnered interest as a novel source of data for drug addiction epidemiology. Often however, the language used in such forums comprises slang and jargon. Currently, there are no publicly available resources to automatically analyse the esoteric language-use in the social media drug-use sub-culture. This lacunae introduces critical challenges for interpreting, sensemaking and modeling of addiction epidemiology using social media. Results Drug-Use Insights (DUI) is a public and open-source web application to address the aforementioned deficiency. DUI is underlined by a hierarchical taxonomy encompassing 108 different addiction related categories consisting of over 9,000 terms, where each category encompasses a set of semantically related terms. These categories and terms were established by utilizing thematic analysis in conjunction with term embeddings generated from 7,472,545 Reddit posts made by 1,402,017 redditors. Given post(s) from social media forums such as Reddit and Twitter, DUI can be used foremost to identify constituent terms related to drug use. Furthermore, the DUI categories and integrated visualization tools can be leveraged for semantic- and exploratory analysis. To the best of our knowledge, DUI utilizes the largest number of substance use and recovery social media posts used in a study and represents the first significant online taxonomy of drug abuse terminology. Availability The DUI web server and source code are available at: http://haddock9.sfsu.edu/insight/ Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  4. Query understanding plays a key role in exploring users’ search intents. However, it is inherently challenging since it needs to capture semantic information from short and ambiguous queries and often requires massive task-specific labeled data. In recent years, pre-trained language models (PLMs) have advanced various natural language processing tasks because they can extract general semantic information from large-scale corpora. However, directly applying them to query understanding is sub-optimal because existing strategies rarely consider to boost the search performance. On the other hand, search logs contain user clicks between queries and urls that provide rich users’ search behavioral information on queries beyond their content. Therefore, in this paper, we aim to fill this gap by exploring search logs. In particular, we propose a novel graph-enhanced pre-training framework, GE-BERT, which leverages both query content and the query graph to capture both semantic information and users’ search behavioral information of queries. Extensive experiments on offline and online tasks have demonstrated the effectiveness of the proposed framework. 
    more » « less
  5. An abundance of biomedical data is generated in the form of clinical notes, reports, and research articles available online. This data holds valuable information that requires extraction, retrieval, and transformation into actionable knowledge. However, this information has various access challenges due to the need for precise machine-interpretable semantic metadata required by search engines. Despite search engines' efforts to interpret the semantics information, they still struggle to index, search, and retrieve relevant information accurately. To address these challenges, we propose a novel graph-based semantic knowledge-sharing approach to enhance the quality of biomedical semantic annotation by engaging biomedical domain experts. In this approach, entities in the knowledge-sharing environment are interlinked and play critical roles. Authorial queries can be posted on the "Knowledge Cafe," and community experts can provide recommendations for semantic annotations. The community can further validate and evaluate the expert responses through a voting scheme resulting in a transformed "Knowledge Cafe" environment that functions as a knowledge graph with semantically linked entities. We evaluated the proposed approach through a series of scenarios, resulting in precision, recall, F1-score, and accuracy assessment matrices. Our results showed an acceptable level of accuracy at approximately 90%. The source code for "Semantically" is freely available at: https://github.com/bukharilab/Semantically 
    more » « less