Historical data sources, like medical records or biological collections, consist of unstructured heterogeneous content: handwritten text, different sizes and types of fonts, and text overlapped with lines, images, stamps, and sketches. The information these documents can provide is important, from a historical perspective and mainly because we can learn from it. The automatic digitization of these historical documents is a complex machine learning process that usually produces poor results, requiring costly interventions by experts, who have to transcribe and interpret the content. This paper describes hybrid (Human- and Machine-Intelligent) workflows for scientific data extraction, combining machine-learning and crowdsourcing software elements. Our results demonstrate that the mix of human and machine processes has advantages in data extraction time and quality, when compared to a machine-only workflow. More specifically, we show how OCRopus and Tesseract, two widely used open source Optical Character Recognition (OCR) tools, can improve their accuracy by more than 42%, when text areas are cropped by humans prior to OCR, while the total time can increase or decrease depending on the OCR selection. The digitization of 400 images, with Entomology, Bryophyte, and Lichen specimens, is evaluated following four different approaches: processing the whole specimen image (machine-only), processing crowd cropped labels (hybrid), processing crowd cropped fields (hybrid), and cleaning the machine-only output. As a secondary result, our experiments reveal differences in speed and quality between Tesseract and OCRopus.
more »
« less
A Gaussian Process Upsampling Model for Improvements in Optical Character Recognition
The automatic evaluation and extraction of financial documents is a key process in business efficiency. Most of the extraction relies on the Optical Character Recognition (OCR), whose outcome is dependent on the quality of the document image. The image data fed to the automated systems can be of unreliable quality, inherently low-resolution or downsampled and compressed by a transmitting program. In this paper, we illustrate a novel Gaussian Process (GP) upsampling model for the purposes of improving OCR process and extraction through upsampling low resolution documents.
more »
« less
- Award ID(s):
- 1908834
- PAR ID:
- 10280214
- Date Published:
- Journal Name:
- International Symposium on Visual Computing
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Capturing document images with hand-held devices in unstructured environments is a common practice nowadays. However, “casual” photos of documents are usually unsuitable for automatic information extraction, mainly due to physical distortion of the document paper, as well as various camera positions and illumination conditions. In this work, we propose DewarpNet, a deep-learning approach for document image unwarping from a single image. Our insight is that the 3D geometry of the document not only determines the warping of its texture but also causes the illumination effects. Therefore, our novelty resides on the explicit modeling of 3D shape for document paper in an end-to-end pipeline. Also, we contribute the largest and most comprehensive dataset for document image unwarping to date – Doc3D. This dataset features multiple ground-truth annotations, including 3D shape, surface normals, UV map, albedo image, etc. Training with Doc3D, we demonstrate state-of-the-art performance for DewarpNet with extensive qualitative and quantitative evaluations. Our network also significantly improves OCR performance on captured document images, decreasing character error rate by 42% on average. Both the code and the dataset are released.more » « less
-
Information Extraction (IE) from imaged text is affected by the output quality of the text-recognition process. Misspelled or missing text may propagate errors or even preclude IE. Low confidence in automated methods is the reason why some IE projects rely exclusively on human work (crowdsourcing). That is the case of biological collections (biocollections), where the metadata (Darwin-core Terms) found in digitized labels are transcribed by citizen scientists. In this paper, we present an approach to reduce the number of crowdsourcing tasks required to obtain the transcription of the text found in biocollections' images. By using an ensemble of Optical Character Recognition (OCR) engines - OCRopus, Tesseract, and the Google Cloud OCR - our approach identifies the lines and characters that have a high probability of being correct. This reduces the need for crowdsourced transcription to be done for only low confidence fragments of text. The number of lines to transcribe is also reduced through hybrid human-machine crowdsourcing where the output of the ensemble of OCRs is used as the first "human" transcription of the redundant crowdsourcing process. Our approach was tested in six biocollections (2,966 images), reducing the number of crowdsourcing tasks by 76% (58% due to lines accepted by the ensemble of OCRs and about 18% due to accelerated convergence when using hybrid crowdsourcing). The automatically extracted text presented a character error rate of 0.001 (0.1%).more » « less
-
Recent supervised point cloud upsampling methods are re-stricted by the size of training data and are limited in terms of covering all object shapes. Besides the challenges faced due to data acquisition, the networks also struggle to gener-alize on unseen records. In this paper, we present an internal point cloud upsampling approach at a holistic level referred to as “Zero-Shot” Point Cloud Upsampling (ZSPU). Our approach is data agnostic and relies solely on the internal infor-mation provided by a particular point cloud without patching in both self-training and testing phases. This single-stream design significantly reduces the training time by learning the relation between low resolution (LR) point clouds and their high (original) resolution (HR) counterparts. This association will then provide super resolution (SR) outputs when origi-nal point clouds are loaded as input. ZSPU achieves com-petitive/superior quantitative and qualitative performances on benchmark datasets when compared with other upsampling methods.more » « less
-
Generative models learned from training using deep learning methods can be used as priors in under-determined inverse problems, including imaging from sparse set of measurements. In this paper, we present a novel hierarchical deep-generative model MrSARP for SAR imagery that can synthesize SAR images of a target at different resolutions jointly. MrSARP is trained in conjunction with a critic that scores multi resolution images jointly to decide if they are realistic images of a target at different resolutions. We show how this deep generative model can be used to retrieve the high spatial resolution image from low resolution images of the same target. The cost function of the generator is modified to improve its capability to retrieve the input parameters for a given set of resolution images. We evaluate the model's performance using three standard error metrics used for evaluating super-resolution performance on simulated data and compare it to upsampling and sparsity based image super-resolution approaches.more » « less
An official website of the United States government

