skip to main content


Title: Semantic knowledge management system for design documentation with heterogeneous data using machine learning
Design documentation is presumed to contain massive amounts of valuable information and expert knowledge that is useful for learning from the past successes and failures. However, the current practice of documenting design in most industries does not result in big data that can support a true digital transformation of enterprise. Very little information on concepts and decisions in early product design has been digitally captured, and the access and retrieval of them via taxonomy-based knowledge management systems are very challenging because most rule-based classification and search systems cannot concurrently process heterogeneous data (text, figures, tables, references). When experts retire or leave a design unit, industry often cannot benefit from past knowledge for future product design, and is left to reinvent the wheel repeatedly. In this work, we present AI-based Natural Language Processing (NLP) models which are trained for contextually representing technical documents containing texts, figures and tables, to do a semantic search for the retrieval of relevant data across large corpora of documents. By connecting textual and non-textual data through the use of an associative database, the semantic search question-answering system we developed can provide more comprehensive answers in the context of users’ questions. For the demonstration and assessment of this model, the semantic search question-answering system is applied to the Intergovernmental Panel on Climate Change (IPCC) Special Report 2019, which is more than 600 pages long and difficult to read and understand, even by most experts. Users can input custom queries relating to climate change concerns and receive evidence from the report that is contextually meaningful. We expect this method can transform current repositories of design documentation of heterogeneous data forms into structured knowledge-bases which can return relevant information efficiently as well as can evolve to embody manageable big data for the true digital transformation of design.  more » « less
Award ID(s):
1854833
NSF-PAR ID:
10340890
Author(s) / Creator(s):
; ; ;
Editor(s):
Anwer, Nabil
Date Published:
Journal Name:
32nd CIRP Design Conference (CIRP Design 2022) - Design in a changing world
Volume:
109
Page Range / eLocation ID:
95-100
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Nearly every artifact of the modern engineering design process is digitally recorded and stored, resulting in an overwhelming amount of raw data detailing past designs. Analyzing this design knowledge and extracting functional information from sets of digital documents is a difficult and time-consuming task for human designers. For the case of textual documentation, poorly written superfluous descriptions filled with jargon are especially challenging for junior designers with less domain expertise to read. If the task of reading documents to extract functional requirements could be automated, designers could actually benefit from the distillation of massive digital repositories of design documentation into valuable information that can inform engineering design. This paper presents a system for automating the extraction of structured functional requirements from textual design documents by applying state of the art Natural Language Processing (NLP) models. A recursive method utilizing Machine Learning-based question-answering is developed to process design texts by initially identifying the highest-level functional requirement, and subsequently extracting additional requirements contained in the text passage. The efficacy of this system is evaluated by comparing the Machine Learning-based results with a study of 75 human designers performing the same design document analysis task on technical texts from the field of Microelectromechanical Systems (MEMS). The prospect of deploying such a system on the sum of all digital engineering documents suggests a future where design failures are less likely to be repeated and past successes may be consistently used to forward innovation.

     
    more » « less
  2. An abundance of biomedical data is generated in the form of clinical notes, reports, and research articles available online. This data holds valuable information that requires extraction, retrieval, and transformation into actionable knowledge. However, this information has various access challenges due to the need for precise machine-interpretable semantic metadata required by search engines. Despite search engines' efforts to interpret the semantics information, they still struggle to index, search, and retrieve relevant information accurately. To address these challenges, we propose a novel graph-based semantic knowledge-sharing approach to enhance the quality of biomedical semantic annotation by engaging biomedical domain experts. In this approach, entities in the knowledge-sharing environment are interlinked and play critical roles. Authorial queries can be posted on the "Knowledge Cafe," and community experts can provide recommendations for semantic annotations. The community can further validate and evaluate the expert responses through a voting scheme resulting in a transformed "Knowledge Cafe" environment that functions as a knowledge graph with semantically linked entities. We evaluated the proposed approach through a series of scenarios, resulting in precision, recall, F1-score, and accuracy assessment matrices. Our results showed an acceptable level of accuracy at approximately 90%. The source code for "Semantically" is freely available at: https://github.com/bukharilab/Semantically 
    more » « less
  3. An abundance of biomedical data is generated in the form of clinical notes, reports, and research articles available online. This data holds valuable information that requires extraction, retrieval, and transformation into actionable knowledge. However, this information has various access challenges due to the need for precise machine-interpretable semantic metadata required by search engines. Despite search engines' efforts to interpret the semantics information, they still struggle to index, search, and retrieve relevant information accurately. To address these challenges, we propose a novel graph-based semantic knowledge-sharing approach to enhance the quality of biomedical semantic annotation by engaging biomedical domain experts. In this approach, entities in the knowledge-sharing environment are interlinked and play critical roles. Authorial queries can be posted on the "Knowledge Cafe," and community experts can provide recommendations for semantic annotations. The community can further validate and evaluate the expert responses through a voting scheme resulting in a transformed "Knowledge Cafe" environment that functions as a knowledge graph with semantically linked entities. We evaluated the proposed approach through a series of scenarios, resulting in precision, recall, F1-score, and accuracy assessment matrices. Our results showed an acceptable level of accuracy at approximately 90%. The source code for "Semantically" is freely available at: https://github.com/bukharilab/Semantically 
    more » « less
  4. Physical and digital documents often contain visually rich information. With such information, there is no strict order- ing or positioning in the document where the data values must appear. Along with textual cues, these documents often also rely on salient visual features to define distinct semantic boundaries and augment the information they disseminate. When performing information extraction (IE), traditional techniques fall short, as they use a text-only representation and do not consider the visual cues inherent to the layout of these documents. We propose VS2, a generalized approach for information extraction from heterogeneous visually rich documents. There are two major contributions of this work. First, we propose a robust segmentation algorithm that de- composes a visually rich document into a bag of visually iso- lated but semantically coherent areas, called logical blocks. Document type agnostic low-level visual and semantic fea- tures are used in this process. Our second contribution is a distantly supervised search-and-select method for identify- ing the named entities within these documents by utilizing the context boundaries defined by these logical blocks. Ex- perimental results on three heterogeneous datasets suggest that the proposed approach significantly outperforms its text-only counterparts on all datasets. Comparing it against the state-of-the-art methods also reveal that VS2 performs comparably or better on all datasets. 
    more » « less
  5. Abstract Motivation

    Figures in biomedical papers communicate essential information with the potential to identify relevant documents in biomedical and clinical settings. However, academic search interfaces mainly search over text fields.

    Results

    We describe a search system for biomedical documents that leverages image modalities and an existing index server. We integrate a problem-specific taxonomy of image modalities and image-based data into a custom search system. Our solution features a front-end interface to enhance classical document search results with image-related data, including page thumbnails, figures, captions and image-modality information. We demonstrate the system on a subset of the CORD-19 document collection. A quantitative evaluation demonstrates higher precision and recall for biomedical document retrieval. A qualitative evaluation with domain experts further highlights our solution’s benefits to biomedical search.

    Availability and implementation

    A demonstration is available at https://runachay.evl.uic.edu/scholar. Our code and image models can be accessed via github.com/uic-evl/bio-search. The dataset is continuously expanded.

     
    more » « less