Quality-Aware Human-Machine Text Extraction for Biocollections using Ensembles of OCRs

Alzuru, Icaro; Stephens, Rhiannon; Matsunaga, Andrea; Tsugawa, Mauricio; Flemons, Paul; Fortes, Jose A.B.

doi:10.1109/eScience.2019.00020

Citation Details

Quality-Aware Human-Machine Text Extraction for Biocollections using Ensembles of OCRs

Information Extraction (IE) from imaged text is affected by the output quality of the text-recognition process. Misspelled or missing text may propagate errors or even preclude IE. Low confidence in automated methods is the reason why some IE projects rely exclusively on human work (crowdsourcing). That is the case of biological collections (biocollections), where the metadata (Darwin-core Terms) found in digitized labels are transcribed by citizen scientists. In this paper, we present an approach to reduce the number of crowdsourcing tasks required to obtain the transcription of the text found in biocollections' images. By using an ensemble of Optical Character Recognition (OCR) engines - OCRopus, Tesseract, and the Google Cloud OCR - our approach identifies the lines and characters that have a high probability of being correct. This reduces the need for crowdsourced transcription to be done for only low confidence fragments of text. The number of lines to transcribe is also reduced through hybrid human-machine crowdsourcing where the output of the ensemble of OCRs is used as the first "human" transcription of the redundant crowdsourcing process. Our approach was tested in six biocollections (2,966 images), reducing the number of crowdsourcing tasks by 76% (58% due to lines accepted by the ensemble of OCRs and about 18% due to accelerated convergence when using hybrid crowdsourcing). The automatically extracted text presented a character error rate of 0.001 (0.1%). more »

Award ID(s):: 1535086

PAR ID:: 10159714

Author(s) / Creator(s):: Alzuru, Icaro; Stephens, Rhiannon; Matsunaga, Andrea; Tsugawa, Mauricio; Flemons, Paul; Fortes, Jose A.B.

Date Published:: 2019-09-01

Journal Name:: 2019 15th International Conference on Science (eScience), San Diego, CA, USA

Page Range / eLocation ID:: 116 to 125

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/eScience.2019.00020

More Like this