skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Information Extraction from Text Regions with Complex Tabular Structure.
Recent innovations have improved layout analysis of document images, significantly improving our ability to identify text and non-text regions. However, extracting information from within text regions remains quite challenging because the text region may have a complex structure. In this paper, we present a new dataset with complex tabular structure, and propose new methods to robustly retrieve information from the complex text region.  more » « less
Award ID(s):
1823616
PAR ID:
10209529
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Conference on Neural Information Processing Systems
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Text information in scanned documents becomes accessible only when extracted and interpreted by a text recognizer. For a recognizer to work successfully, it must have detailed location information about the regions of the document images that it is asked to analyse. It will need focus on page regions with text skipping non-text regions that include illustrations or photographs. However, text recognizers do not work as logical analyzers. Logical layout analysis automatically determines the function of a document text region, that is, it labels each region as a title, paragraph, or caption, and so on, and thus is an essential part of a document understanding system. In the past, rule-based algorithms have been used to conduct logical layout analysis, using limited size data sets. We here instead focus on supervised learning methods for logical layout analysis. We describe LABA, a system based on multiple support vector machines to perform logical Layout Analysis of scanned Books pages in Arabic. The system detects the function of a text region based on the analysis of various images features and a voting mechanism. For a baseline comparison, we implemented an older but state-of-the-art neural network method. We evaluated LABA using a data set of scanned pages from illustrated Arabic books and obtained high recall and precision values. We also found that the F-measure of LABA is higher for five of the tested six classes compared to the state-of-the-art method. 
    more » « less
  2. Despite decades of effort, the morphological structure of the Milky Way remains hidden behind dust extinction, small number statistics, and complicated datasets. HII regions, the volumes of ionized gas surrounding recently-formed massive stars, are a classic tracer of spiral arms in galaxies. Over the past decade, the HII Region Discovery Surveys have nearly tripled the number of known Galactic HII regions. With the new Galaxy-wide flux-limited sample of Milky Way HII regions, we are poised to revolutionize our understanding of spiral structure across the Galactic disk. Traditional methods of fitting Galactic structure models to the three-dimensional positions of these nebulae are impossible, however, since most Galactic HII regions lack accurate distance determinations. We are developing a novel machine learning approach that uses simulation based inference to fit complex models of Galactic structure to the complicated position-position-velocity HII region dataset, thereby removing the need for accurate distances. Using simulated observations, we demonstrate the efficacy of this new technique and its potential to reveal the structure of spiral arms across the Milky Way. 
    more » « less
  3. State-of-the-art text spotting systems typically aim to detect isolated words or word-by-word text in images of natural scenes and ignore the semantic coherence within a region of text. However, when interpreted together, seemingly isolated words may be easier to recognize. On this basis, we propose a novel "semantic-based text recognition" (STR) deep learning model that reads text in images with the help of understanding context. STR consists of several modules. We introduce the Text Grouping and Arranging (TGA) algorithm to connect and order isolated text regions. A text-recognition network interprets isolated words. Benefiting from semantic information, a sequence-to-sequence network model efficiently corrects inaccurate and uncertain phrases produced earlier in the STR pipeline. We present experiments on two new distinct datasets that contain scanned catalog images of interior designs and photographs of protesters with hand-written signs, respectively. Our results show that our STR model outperforms a baseline method that uses state-of-the-art single-word recognition techniques on both datasets. STR yields a high accuracy rate of 90% on the catalog images and 71% on the more difficult protest images, suggesting its generality in recognizing text. 
    more » « less
  4. Recognizing the promise of natural language interfaces to databases, prior studies have emphasized the development of text-to-SQL systems. Existing research has generally focused on generating SQL statements from text queries, and the broader challenge lies in inferring new information about the returned data. Our research makes two major contributions to address this gap. First, we introduce a novel Internet-of-Things (IoT) text-to-SQL dataset comprising 10,985 text-SQL pairs and 239,398 rows of network traffic activity. The dataset contains additional query types limited in prior text-to-SQL datasets, notably, temporal-related queries. Our dataset is sourced from a smart building’s IoT ecosystem exploring sensor read and network traffic data. Second, our dataset allows two-stage processing, where the returned data (network traffic) from a generated SQL can be categorized as malicious or not. Our results show that joint training to query and infer information about the data improves overall text-to-SQL performance, nearly matching that of substantially larger models. We also show that current large language models (e.g., GPT3.5) struggle to infer new information about returned data (i.e., they are bad at tabular data understanding), thus our dataset provides a novel test bed for integrating complex domain-specific reasoning into LLMs. 
    more » « less
  5. Given limited seismic coverage of the lowermost mantle, less than one-fourth of the core-mantle boundary (CMB) has been surveyed for the presence of ultra-low velocity zones (ULVZs). Investigations that sample the CMB with new geometries are therefore important to further our understanding of ULVZ origins and their potential connection to other deep Earth processes. Using core-reflected ScP waves recorded by the recently deployed Transantarctic Mountains Northern Network in Antarctica, our study aims to expand ULVZ investigations in the southern hemisphere. Our dataset samples the CMB in the vicinity of New Zealand, providing coverage between an area to the northeast, where ULVZ structure has been previously identified, and another region to the south, where prior evidence for an ULVZ was inconclusive. This area is of particular interest because the data sample across the boundary of the Pacific Large Low Shear Velocity Province (LLSVP). The Weddell Sea region near Antarctica is also well sampled, providing new information on a region that has not been previously studied. A correlative scheme between 1-D synthetic seismograms and the observed ScP data demonstrates that ULVZs are required in both study regions. Modeling uncertainties limit our ability to definitively define ULVZ characteristics but also likely indicate more complex 3-D structure. Given that ULVZs are detected within, along the edge of, and far from the Pacific LLSVP, our results support the hypothesis that ULVZs are compositionally distinct from the surrounding mantle. ULVZs may be ubiquitous along the CMB; however, they may be thinner in many regions than can be resolved by current methods. Mantle convection currents may sweep the ULVZs into thicker piles in some areas, pushing these anomalies toward the boundaries of LLSVPs. 
    more » « less