NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Summarization assessment methodology for multiple corpora using queries and classification for functional evaluation

https://doi.org/10.3233/ICA-220680

Wolyn, Sam; Simske, Steven J. (June 2022, Integrated Computer-Aided Engineering)

Extractive summarization is an important natural language processing approach used for document compression, improved reading comprehension, key phrase extraction, indexing, query set generation, and other analytics approaches. Extractive summarization has specific advantages over abstractive summarization in that it preserves style, specific text elements, and compound phrases that might be more directly associated with the text. In this article, the relative effectiveness of extractive summarization is considered on two widely different corpora: (1) a set of works of fiction (100 total, mainly novels) available from Project Gutenberg, and (2) a large set of news articles (3000) for which a ground truthed summarization (gold standard) is provided by the authors of the news articles. Both sets were evaluated using 5 different Python Sumy algorithms and compared to randomly-generated summarizations quantitatively. Two functional approaches to assessing the efficacy of summarization using a query set on both the original documents and their summaries, and using document classification on a 12-class set to compare among different summarization approaches, are introduced. The results, unsurprisingly, show considerable differences consistent with the different nature of these two data sets. The LSA and Luhn summarization approaches were most effective on the database of fiction, while all five summarization approaches were similarly effective on the database of articles. Overall, the Luhn approach was deemed the most generally relevant among those tested.
more » « less
Full Text Available
Engineering of an artificial intelligence safety data sheet document processing system for environmental, health, and safety compliance

https://doi.org/10.1145/3469096.3474933

Fenton, Kevin; Simske, Steven (August 2021, DocEng '21: Proceedings of the 21st ACM Symposium on Document Engineering)
null (Ed.)
Chemical Safety Data Sheets (SDS) are the primary method by which chemical manufacturers communicate the ingredients and hazards of their products to the public. These SDSs are used for a wide variety of purposes ranging from environmental calculations to occupational health assessments to emergency response measures. Although a few companies have provided direct digital data transfer platforms using xml or equivalent schemata, the vast majority of chemical ingredient and hazard communication to product users still occurs through the use of millions of PDF documents that are largely loaded through manual data entry into downstream user databases. This research focuses on the reverse engineering of SDS document types to adapt to various layouts and the harnessing of meta-algorithmic and neural network approaches to provide a means of moving industrial institutions towards a digital universal SDS processing methodology. The complexities of SDS documents including the lack of format standardization, text and image combinations, and multi-lingual translation needs, combined, limit the accuracy and precision of optical character recognition tools. The approach in this document is to translate entire SDSs from thousands of chemical vendors, each with distinct formatting, to machine-encoded text with a high degree of accuracy and precision. Then the system will "read" and assess these documents as a human would; that is, ensuring that the documents are compliant, determining whether chemical formulations have changed, ensuring reported values are within expected thresholds, and comparing them to similar products for more environmentally friendly alternatives.
more » « less
Full Text Available
Potentials of blockchain technologies for supply chain collaboration: a conceptual framework

https://doi.org/10.1108/IJLM-02-2020-0098

Rejeb, Abderahman; Keogh, John G.; Simske, Steven J.; Stafford, Thomas; Treiblmaier, Horst (February 2021, The International Journal of Logistics Management)
null (Ed.)
Purpose The purpose of this study is to investigate the potentials of blockchain technologies (BC) for supply chain collaboration (SCC). Design/methodology/approach Building on a narrative literature review and analysis of seminal SCC research, BC characteristics are integrated into a conceptual framework consisting of seven key dimensions: information sharing, resource sharing, decision synchronization, goal congruence, incentive alignment, collaborative communication and joint knowledge creation. The relevance of each category is briefly assessed. Findings BC technologies can impact collaboration between transaction partners in modern supply chains (SCs) by streamlining information sharing processes, by supporting decision and reward models and by strengthening communicative relationships with SC partners. BC promises important future capabilities in SCs by facilitating auditability, improving accountability, enhancing data and information transparency and improving trust in B2B relationships. The technology also promises to strengthen collaboration and to overcome vulnerabilities related to moral hazard and shortcomings found in legacy technologies. Research limitations/implications The paper is mainly focused on the potentials of BC technologies on SCC as envisioned in the current academic literature. Hence, there is a need to validate the theoretical inferences with other approaches such as expert interviews and empirical tests. This study is of use to practitioners and decision-makers seeking to engage in BC-collaborative SC models. Originality/value The value of this paper lies in its call for an increased focus on the possibilities of BC technologies to support SCC. This study also contributes to the literature by filling the knowledge gap of how BC potentially impacts SC management.
more » « less
Full Text Available
Differentiating Digital Printing Through Physical and Chemical Analyses

Sousa Ribeiro, Ana C.; Kellar, Jon J.; Crawford, Grant A.; Simske, Steven J.; Petersen, Jacob B. (October 2020, NIP & Digital Fabrication Conference, Printing for Fabrication Online 2020 Final Program and Proceedings)
null (Ed.)
Over the past decade, the trade of counterfeit goods has increased. This has been enabled by advancements in low-cost digital printing methods (e.g., inkjet and laserjet) that are an asset for counterfeit production methods. However, each printing method produces characteristic printed features that can be used to identify not only the printing method, but also, uniquely identify the specific make and model of printer. This knowledge can be used for determination of whether or not the analyzed item is counterfeit. During the first phase of this research, chemical and physical analyses were performed on printed documents and ink samples for two types of digital printing: inkjet and laserjet. The results showed that it is possible to identify the digital method used to print a document by its unique features. Physical analysis revealed that the laserjet prints have a higher image quality characterized by sharper feature edge quality, brighter image area, and a thicker ink layer (10 micron average thickness) than in inkjet documents. Chemical analysis showed that the inkjet and laserjet inks could easily be distinguished by identifying the various ink components. Ink jet inks included (among others) water, ethylene glycol while laserjet inks presented styrene, methacrylate, and sulfide compounds.
more » « less
Full Text Available
A Method for Estimating Driving Factors of Illicit Trade Using Node Embeddings and Clustering

https://doi.org/10.1007/978-3-030-49076-8_22

González Ordiano, Jorge Ángel; Finn, Lisa; Winterlich, Anthony; Moloney, Gary; Simske, Steven (April 2020, Mexican Conference on Pattern Recognition)

The trade on illegal goods and services, also known as illicit trade, is expected to drain 4.2 trillion dollars from the world economy and put 5.4 million jobs at risk by 2022. These estimates reflect the importance of combating illicit trade, as it poses a danger to individuals and undermines governments. To do so, however, we have to fi rst understand the factors that influence this type of trade. Therefore, we present in this article a method that uses node embeddings and clustering to compare a country based illicit supply network to other networks that represent other types of country relationships (e.g., free trade agreements, language). The results offer initial clues on the factors that might be driving the illicit trade between countries.
more » « less
Full Text Available
On the Analysis of Illicit Supply Networks Using Variable State Resolution-Markov Chains

https://doi.org/10.1007/978-3-030-50146-4_38

González Ordiano, Jorge Ángel; Finn, Lisa; Winterlich, Anthony; Moloney, Gary; Simske, Steven (April 2020, International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems)

The trade in illicit items, such as counterfeits, not only leads to the loss of large sums of private and public revenue, but also poses a danger to individuals, undermines governments, and--in the most extreme cases--fi nances criminal organizations. It is estimated that in 2013 trade in illicit items accounted for 2.5% of the global commerce. To combat illicit trade, it is necessary to understand its illicit supply networks. Therefore, we present in this article an approach that is able to find an optimal description of an illicit supply network using a series of Variable State Resolution-Markov Chains. The new method is applied to a real-world dataset stemming from the Global Product Authentication Service of Micro Focus International. The results show how an illicit supply network might be analyzed with the help of this method.
more » « less
Full Text Available
The CNN-Corpus: A Large textual Corpus for Single-Document Extractive Summarization

https://doi.org/10.1145/3342558.3345388

Lins, Rafael Dueire; Oliveira, Hilario; Cabral, Luciano; Batista, Jamilson; Tenorio, Bruno; Ferreira, Rafael; Lima, Rinaldo; de França Pereira e Silva, Gabriel; Simske, Steven J (October 2019, Proceedings of the ACM Symposium on Document Engineering)

This paper details the features and the methodology adopted in the construction of the CNN-corpus, a test corpus for single document extractive text summarization of news articles. The current version of the CNN-corpus encompasses 3,000 texts in English, and each of them has an abstractive and an extractive summary. The corpus allows quantitative and qualitative assessments of extractive summarization strategies.
more » « less
Full Text Available
The CNN-Corpus in Spanish: a Large Corpus for Extractive Text Summarization in the Spanish Language

https://doi.org/10.1145/3342558.3345423

Lins, Rafael Dueire; Oliveira, Hilario; Cabral, Luciano; Batista, Jamilson; Tenorio, Bruno; Salcedo, Diego A; Ferreira, Rafael; Lima, Rinaldo; de França Pereira e Silva, Gabriel; Simske, Steven J (January 2019, Proceedings of the ACM Symposium on Document Engineering)

This paper details the development and features of the CNN-corpus in Spanish, possibly the largest test corpus for single document extractive text summarization in the Spanish language. Its current version encompasses 1,117 well-written texts in Spanish, each of them has an abstractive and an extractive summary. The development methodology adopted allows good-quality qualitative and quantitative assessments of summarization strategies for tools developed in the Spanish language.
more » « less
Full Text Available

Search for: All records