Abstract MotivationFigures in biomedical papers communicate essential information with the potential to identify relevant documents in biomedical and clinical settings. However, academic search interfaces mainly search over text fields. ResultsWe describe a search system for biomedical documents that leverages image modalities and an existing index server. We integrate a problem-specific taxonomy of image modalities and image-based data into a custom search system. Our solution features a front-end interface to enhance classical document search results with image-related data, including page thumbnails, figures, captions and image-modality information. We demonstrate the system on a subset of the CORD-19 document collection. A quantitative evaluation demonstrates higher precision and recall for biomedical document retrieval. A qualitative evaluation with domain experts further highlights our solution’s benefits to biomedical search. Availability and implementationA demonstration is available at https://runachay.evl.uic.edu/scholar. Our code and image models can be accessed via github.com/uic-evl/bio-search. The dataset is continuously expanded.
more »
« less
MouseScholar: Evaluating an Image+Text Search System for Biocuration
Biocuration is the process of analyzing biological or biomedical articles to organize biological data into data repositories using taxonomies and ontologies. Due to the expanding number of articles and the relatively small number of biocurators, automation is desired to improve the workflow of assessing articles worth curating. As figures convey essential information, automatically integrating images may improve curation. In this work, we instantiate and evaluate a first-in-kind, hybrid image+text document search system for biocuration. The system, MouseScholar, leverages an image modality taxonomy derived in collaboration with biocurators, in addition to figure segmentation, and classifiers components as a back-end and a streamlined front-end interface to search and present document results. We formally evaluated the system with ten biocurators on a mouse genome informatics biocuration dataset and collected feedback. The results demonstrate the benefits of blending text and image information when presenting scientific articles for biocuration.
more »
« less
- Award ID(s):
- 2320261
- PAR ID:
- 10536556
- Publisher / Repository:
- IEEE Xplore
- Date Published:
- ISBN:
- 979-8-3503-3748-8
- Page Range / eLocation ID:
- 1473-1480
- Subject(s) / Keyword(s):
- document search biocuration
- Format(s):
- Medium: X
- Location:
- Istanbul, Turkiye
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract In the biomedical domain, taxonomies organize the acquisition modalities of scientific images in hierarchical structures. Such taxonomies leverage large sets of correct image labels and provide essential information about the importance of a scientific publication, which could then be used in biocuration tasks. However, the hierarchical nature of the labels, the overhead of processing images, the absence or incompleteness of labelled data and the expertise required to label this type of data impede the creation of useful datasets for biocuration. From a multi‐year collaboration with biocurators and text‐mining researchers, we derive an iterative visual analytics and active learning (AL) strategy to address these challenges. We implement this strategy in a system called BI‐LAVA—Biocuration with Hierarchical Image Labelling through Active Learning and Visual Analytics. BI‐LAVA leverages a small set of image labels, a hierarchical set of image classifiers and AL to help model builders deal with incomplete ground‐truth labels, target a hierarchical taxonomy of image modalities and classify a large pool of unlabelled images. BI‐LAVA's front end uses custom encodings to represent data distributions, taxonomies, image projections and neighbourhoods of image thumbnails, which help model builders explore an unfamiliar image dataset and taxonomy and correct and generate labels. An evaluation with machine learning practitioners shows that our mixed human–machine approach successfully supports domain experts in understanding the characteristics of classes within the taxonomy, as well as validating and improving data quality in labelled and unlabelled collections.more » « less
-
null (Ed.)Query Biased Summarization (QBS) aims to produce a summary of the documents retrieved against a query to reduce the human effort for inspecting the full-text content of a document. Typical summarization approaches extract a document text snippet that has term overlap with the query and show that to a searcher. While snippets show relevant information in a document, to the best of our knowledge, there does not exist a summarization system that shows what relevant concepts is missing in a document. Our study focuses on the reduction of user effort in finding relevant documents by exposing them to omitted relevant information. To this end, we use a classical approach, DSPApprox, to find terms or phrases relevant to a query. Then we identify which terms or phrases are missing in a document, present them in a search interface, and ask crowd workers to judge document relevance based on snippets and missing information. Experimental results show both benefits and limitations of this approach.more » « less
-
This paper introduces a novel generative encoder (GE) framework for generative imaging and image processing tasks like image reconstruction, compression, denoising, inpainting, deblurring, and super-resolution. GE unifies the generative capacity of GANs and the stability of AEs in an optimization framework instead of stacking GANs and AEs into a single network or combining their loss functions as in existing literature. GE provides a novel approach to visualizing relationships between latent spaces and the data space. The GE framework is made up of a pre-training phase and a solving phase. In the former, a GAN with generator \begin{document}$ G $$\end{document} capturing the data distribution of a given image set, and an AE network with encoder \begin{document}$$ E $$\end{document} that compresses images following the estimated distribution by \begin{document}$$ G $$\end{document} are trained separately, resulting in two latent representations of the data, denoted as the generative and encoding latent space respectively. In the solving phase, given noisy image \begin{document}$$ x = \mathcal{P}(x^*) $$\end{document}, where \begin{document}$$ x^* $$\end{document} is the target unknown image, \begin{document}$$ \mathcal{P} $$\end{document} is an operator adding an addictive, or multiplicative, or convolutional noise, or equivalently given such an image \begin{document}$$ x $$\end{document} in the compressed domain, i.e., given \begin{document}$$ m = E(x) $$\end{document}, the two latent spaces are unified via solving the optimization problem \begin{document}$$ z^* = \underset{z}{\mathrm{argmin}} \|E(G(z))-m\|_2^2+\lambda\|z\|_2^2 $$\end{document} and the image \begin{document}$$ x^* $$\end{document} is recovered in a generative way via \begin{document}$$ \hat{x}: = G(z^*)\approx x^* $$\end{document}, where \begin{document}$$ \lambda>0 $$\end{document}$ is a hyperparameter. The unification of the two spaces allows improved performance against corresponding GAN and AE networks while visualizing interesting properties in each latent space.more » « less
-
The abundance of scientific articles published and indexed in publicly accessible repositories has spurred the research and development of automated information extraction systems. The output of such systems can be used to assemble large networks capturing the understanding of mechanistic pathways and their interactions as represented in the underlying body of research. We describe a system designed to help researchers search, visualize and interact with biological networks derived via information extraction tools. As input, the system takes a dataset of biological and biochemical interactions automatically generated by an information extraction system and provides an interface designed to search, visualize and interact with the data. The usage paradigm consists of identifying a starting point for a search, then using the data’s network structure by incrementally exploring the immediate neighborhood of the elements displayed by the system.more » « less
An official website of the United States government

