skip to main content


Title: A Large Dataset of Historical Japanese Documents with Complex Layouts
Deep learning-based approaches for automatic document layout analysis and content extraction have the potential to unlock rich information trapped in historical documents on a large scale. One major hurdle is the lack of large datasets for training robust models. In particular, little training data exist for Asian languages. To this end, we present HJDataset, a Large Dataset of Historical Japanese Documents with Complex Layouts. It contains over 250,000 layout element annotations of seven types. In addition to bounding boxes and masks of the content regions, it also includes the hierarchical structures and reading orders for layout elements. The dataset is constructed using a combination of human and machine efforts. A semi-rule based method is developed to extract the layout elements, and the results are checked by human inspectors. The resulting large-scale dataset is used to provide baseline performance analyses for text region detection using state-of-the-art deep learning models. And we demonstrate the usefulness of the dataset on real-world document digitization tasks.  more » « less
Award ID(s):
1823616
NSF-PAR ID:
10209528
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
IEEE/CVF Conference on Computer Vision and Pattern Recognition
Page Range / eLocation ID:
548-559
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Text categorization is an essential task in Web content analysis. Considering the ever-evolving Web data and new emerging categories, instead of the laborious supervised setting, in this paper, we focus on the minimally-supervised setting that aims to categorize documents effectively, with a couple of seed documents annotated per category. We recognize that texts collected from the Web are often structure-rich, i.e., accompanied by various metadata. One can easily organize the corpus into a text-rich network, joining raw text documents with document attributes, high-quality phrases, label surface names as nodes, and their associations as edges. Such a network provides a holistic view of the corpus’ heterogeneous data sources and enables a joint optimization for network-based analysis and deep textual model training. We therefore propose a novel framework for minimally supervised categorization by learning from the text-rich network. Specifically, we jointly train two modules with different inductive biases – a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning. Each module generates pseudo training labels from the unlabeled document set, and both modules mutually enhance each other by co-training using pooled pseudo labels. We test our model on two real-world datasets. On the challenging e-commerce product categorization dataset with 683 categories, our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%, significantly outperforming all compared methods; our accuracy is only less than 2% away from the supervised BERT model trained on about 50K labeled documents. 
    more » « less
  2. null (Ed.)
    Along with textual content, visual features play an essential role in the semantics of visually rich documents. Information extraction (IE) tasks perform poorly on these documents if these visual cues are not taken into account. In this paper, we present Artemis - a visually aware, machine-learning-based IE method for heterogeneous visually rich documents. Artemis represents a visual span in a document by jointly encoding its visual and textual context for IE tasks. Our main contribution is two-fold. First, we develop a deep-learning model that identifies the local context boundary of a visual span with minimal human-labeling. Second, we describe a deep neural network that encodes the multimodal context of a visual span into a fixed-length vector by taking its textual and layout-specific features into account. It identifies the visual span(s) containing a named entity by leveraging this learned representation followed by an inference task. We evaluate Artemis on four heterogeneous datasets from different domains over a suite of information extraction tasks. Results show that it outperforms state-of-the-art text-based methods by up to 17 points in F1-score. 
    more » « less
  3. Obeid, I. ; Selesnick, I. (Ed.)
    The Neural Engineering Data Consortium at Temple University has been providing key data resources to support the development of deep learning technology for electroencephalography (EEG) applications [1-4] since 2012. We currently have over 1,700 subscribers to our resources and have been providing data, software and documentation from our web site [5] since 2012. In this poster, we introduce additions to our resources that have been developed within the past year to facilitate software development and big data machine learning research. Major resources released in 2019 include: ● Data: The most current release of our open source EEG data is v1.2.0 of TUH EEG and includes the addition of 3,874 sessions and 1,960 patients from mid-2015 through 2016. ● Software: We have recently released a package, PyStream, that demonstrates how to correctly read an EDF file and access samples of the signal. This software demonstrates how to properly decode channels based on their labels and how to implement montages. Most existing open source packages to read EDF files do not directly address the problem of channel labels [6]. ● Documentation: We have released two documents that describe our file formats and data representations: (1) electrodes and channels [6]: describes how to map channel labels to physical locations of the electrodes, and includes a description of every channel label appearing in the corpus; (2) annotation standards [7]: describes our annotation file format and how to decode the data structures used to represent the annotations. Additional significant updates to our resources include: ● NEDC TUH EEG Seizure (v1.6.0): This release includes the expansion of the training dataset from 4,597 files to 4,702. Calibration sequences have been manually annotated and added to our existing documentation. Numerous corrections were made to existing annotations based on user feedback. ● IBM TUSZ Pre-Processed Data (v1.0.0): A preprocessed version of the TUH Seizure Detection Corpus using two methods [8], both of which use an FFT sliding window approach (STFT). In the first method, FFT log magnitudes are used. In the second method, the FFT values are normalized across frequency buckets and correlation coefficients are calculated. The eigenvalues are calculated from this correlation matrix. The eigenvalues and correlation matrix's upper triangle are used to generate feature. ● NEDC TUH EEG Artifact Corpus (v1.0.0): This corpus was developed to support modeling of non-seizure signals for problems such as seizure detection. We have been using the data to build better background models. Five artifact events have been labeled: (1) eye movements (EYEM), (2) chewing (CHEW), (3) shivering (SHIV), (4) electrode pop, electrostatic artifacts, and lead artifacts (ELPP), and (5) muscle artifacts (MUSC). The data is cross-referenced to TUH EEG v1.1.0 so you can match patient numbers, sessions, etc. ● NEDC Eval EEG (v1.3.0): In this release of our standardized scoring software, the False Positive Rate (FPR) definition of the Time-Aligned Event Scoring (TAES) metric has been updated [9]. The standard definition is the number of false positives divided by the number of false positives plus the number of true negatives: #FP / (#FP + #TN). We also recently introduced the ability to download our data from an anonymous rsync server. The rsync command [10] effectively synchronizes both a remote directory and a local directory and copies the selected folder from the server to the desktop. It is available as part of most, if not all, Linux and Mac distributions (unfortunately, there is not an acceptable port of this command for Windows). To use the rsync command to download the content from our website, both a username and password are needed. An automated registration process on our website grants both. An example of a typical rsync command to access our data on our website is: rsync -auxv nedc_tuh_eeg@www.isip.piconepress.com:~/data/tuh_eeg/ Rsync is a more robust option for downloading data. We have also experimented with Google Drive and Dropbox, but these types of technology are not suitable for such large amounts of data. All of the resources described in this poster are open source and freely available at https://www.isip.piconepress.com/projects/tuh_eeg/downloads/. We will demonstrate how to access and utilize these resources during the poster presentation and collect community feedback on the most needed additions to enable significant advances in machine learning performance. 
    more » « less
  4. Recent advances in Augmented Reality (AR) devices and their maturity as a technology offers new modalities for interaction between learners and their learning environments. Such capabilities are particularly important for learning that involves hands-on activities where there is a compelling need to: (a) make connections between knowledge-elements that have been taught at different times, (b) apply principles and theoretical knowledge in a concrete experimental setting, (c) understand the limitations of what can be studied via models and via experiments, (d) cope with increasing shortages in teaching-support staff and instructional material at the intersection of disciplines, and (e) improve student engagement in their learning. AR devices that are integrated into training and education systems can be effectively used to deliver just-in-time informatics to augment physical workspaces and learning environments with virtual artifacts. We present a system that demonstrates a solution to a critical registration problem and enables a multi-disciplinary team to develop the pedagogical content without the need for extensive coding. The most popular approach for developing AR applications is to develop a game using a standard game engine such as UNITY or UNREAL. These engines offer a powerful environment for developing a large variety of games and an exhaustive library of digital assets. In contrast, the framework we offer supports a limited range of human environment interactions that are suitable and effective for training and education. Our system offers four important capabilities – annotation, navigation, guidance, and operator safety. These capabilities are presented and described in detail. The above framework motivates a change of focus – from game development to AR content development. While game development is an intensive activity that involves extensive programming, AR content development is a multi-disciplinary activity that requires contributions from a large team of graphics designers, content creators, domain experts, pedagogy experts, and learning evaluators. We have demonstrated that such a multi-disciplinary team of experts working with our framework can use popular content creation tools to design and develop the virtual artifacts required for the AR system. These artifacts can be archived in a standard relational database and hosted on robust cloud-based backend systems for scale up. The AR content creators can own their content and Non-fungible Tokens to sequence the presentations either to improve pedagogical novelty or to personalize the learning. 
    more » « less
  5. Recent advances in Augmented Reality (AR) devices and their maturity as a technology offers new modalities for interaction between learners and their learning environments. Such capabilities are particularly important for learning that involves hands-on activities where there is a compelling need to: (a) make connections between knowledge-elements that have been taught at different times, (b) apply principles and theoretical knowledge in a concrete experimental setting, (c) understand the limitations of what can be studied via models and via experiments, (d) cope with increasing shortages in teaching-support staff and instructional material at the intersection of disciplines, and (e) improve student engagement in their learning. AR devices that are integrated into training and education systems can be effectively used to deliver just-in-time informatics to augment physical workspaces and learning environments with virtual artifacts. We present a system that demonstrates a solution to a critical registration problem and enables a multi-disciplinary team to develop the pedagogical content without the need for extensive coding. The most popular approach for developing AR applications is to develop a game using a standard game engine such as UNITY or UNREAL. These engines offer a powerful environment for developing a large variety of games and an exhaustive library of digital assets. In contrast, the framework we offer supports a limited range of human environment interactions that are suitable and effective for training and education. Our system offers four important capabilities – annotation, navigation, guidance, and operator safety. These capabilities are presented and described in detail. The above framework motivates a change of focus – from game development to AR content development. While game development is an intensive activity that involves extensive programming, AR content development is a multi-disciplinary activity that requires contributions from a large team of graphics designers, content creators, domain experts, pedagogy experts, and learning evaluators. We have demonstrated that such a multi-disciplinary team of experts working with our framework can use popular content creation tools to design and develop the virtual artifacts required for the AR system. These artifacts can be archived in a standard relational database and hosted on robust cloud-based backend systems for scale up. The AR content creators can own their content and Non-fungible Tokens to sequence the presentations either to improve pedagogical novelty or to personalize the learning. 
    more » « less