skip to main content


Title: LDA v. LSA: A Comparison of Two Computational Text Analysis Tools for the Functional Categorization of Patents
One means to support for design-by-analogy (DbA) in practice involves giving designers efficient access to source analogies as inspiration to solve problems. The patent database has been used for many DbA support efforts, as it is a preexisting repository of catalogued technology. Latent Semantic Analysis (LSA) has been shown to be an effective computational text processing method for extracting meaningful similarities between patents for useful functional exploration during DbA. However, this has only been shown to be useful at a small-scale (100 patents). Considering the vastness of the patent database and realistic exploration at a large scale, it is important to consider how these computational analyses change with orders of magnitude more data. We present analysis of 1,000 random mechanical patents, comparing the ability of LSA to Latent Dirichlet Allocation (LDA) to categorize patents into meaningful groups. Resulting implications for large(r) scale data mining of patents for DbA support are detailed.  more » « less
Award ID(s):
1663204
NSF-PAR ID:
10055536
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
International Conference on Case-Based Reasoning
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Design-by-analogy (DbA) is an important method for innovation that has gained much attention due to its history of leading to successful and novel design solutions. The method uses a repository of existing design solutions where designers can recognize and retrieve analogical inspirations. Yet, exploring for analogical inspiration has been a laborious task for designers. This work presents a computational methodology that is driven by a topic modeling technique called non-negative matrix factorization (NMF). NMF is widely used in the text mining field for its ability to discover topics within documents based on their semantic content. In the proposed methodology, NMF is performed iteratively to build hierarchical repositories of design solutions, with which designers can explore clusters of analogical stimuli. This methodology has been applied to a repository of mechanical design-related patents, processed to contain only component-, behavior-, or material-based content to test if unique and valuable attribute-based analogical inspiration can be discovered from the different representations of patent data. The hierarchical repositories have been visualized, and a case study has been conducted to test the effectiveness of the analogical retrieval process of the proposed methodology. Overall, this paper demonstrates that the exploration-based computational methodology may provide designers an enhanced control over design repositories to retrieve analogical inspiration for DbA practice. 
    more » « less
  2. This paper presents an explorative-based computational methodology to aid the analogical retrieval process in design-by-analogy practice. The computational methodology, driven by Nonnegative Matrix Factorization (NMF), iteratively builds a hierarchical repositories of design solutions within which clusters of design analogies can be explored by designers. In the work, the methodology has been applied on a large repository of mechanical design related patents, processed to contain only component-, behavior-, or material-based content, to demonstrate that unique and valuable attribute-based analogical inspiration can be discovered from different representations of patent data. For explorative purposes, the hierarchical repositories have been visualized with a three-dimensional hierarchical structure and two-dimensional bar graph structure, which can be used interchangeably for retrieving analogies. This paper demonstrates that the explorative-based computational methodology provides designers an enhanced control over design repositories, empowering them to retrieve analogical inspiration for design-by-analogy practice. 
    more » « less
  3. Abstract <italic>Research summary</italic>

    To what extent do firms rely on basic science in their R&D efforts? Several scholars have sought to answer this and related questions, but progress has been impeded by the difficulty of matching unstructured references in patents to published papers. We introduce an open‐access dataset of references from the front pages of patents granted worldwide to scientific papers published since 1800. Each patent‐paper linkage is assigned a confidence score, which is characterized in a random sample by false negatives versus false positives. All matches are available for download athttp://relianceonscience.org. We outline several avenues for strategy research enabled by these new data.

    <italic>Managerial summary</italic>

    To what extent do firms rely on basic science in their R&D efforts? Several scholars have sought to answer this and related questions, but progress has been impeded by the difficulty of matching unstructured references in patents to published papers. We introduce an open‐access dataset of references from the front pages of patents granted worldwide to scientific papers published since 1800. Each patent‐paper linkage is assigned a confidence score, and we check a random sample of these confidence scores by hand in order to estimate both coverage (i.e., of the matches we should have found, what percentage did we find) and accuracy (i.e., of the matches we found, what percentage are correct). We outline several avenues for strategy research enabled by these new data.

     
    more » « less
  4. Learning global features by aggregating information over multiple views has been shown to be effective for 3D shape analysis. For view aggregation in deep learning models, pooling has been applied extensively. However, pooling leads to a loss of the content within views, and the spatial relationship among views, which limits the discriminability of learned features. We propose 3DViewGraph to resolve this issue, which learns 3D global features by more effectively aggregating unordered views with attention. Specifically, unordered views taken around a shape are regarded as view nodes on a view graph. 3DViewGraph first learns a novel latent semantic mapping to project low-level view features into meaningful latent semantic embeddings in a lower dimensional space, which is spanned by latent semantic patterns. Then, the content and spatial information of each pair of view nodes are encoded by a novel spatial pattern correlation, where the correlation is computed among latent semantic patterns. Finally, all spatial pattern correlations are integrated with attention weights learned by a novel attention mechanism. This further increases the discriminability of learned features by highlighting the unordered view nodes with distinctive characteristics and depressing the ones with appearance ambiguity. We show that 3DViewGraph outperforms state-of-the-art methods under three large-scale benchmarks.

     
    more » « less
  5. Obeid, Iyad ; Selesnick, Ivan ; Picone, Joseph (Ed.)
    The goal of this work was to design a low-cost computing facility that can support the development of an open source digital pathology corpus containing 1M images [1]. A single image from a clinical-grade digital pathology scanner can range in size from hundreds of megabytes to five gigabytes. A 1M image database requires over a petabyte (PB) of disk space. To do meaningful work in this problem space requires a significant allocation of computing resources. The improvements and expansions to our HPC (highperformance computing) cluster, known as Neuronix [2], required to support working with digital pathology fall into two broad categories: computation and storage. To handle the increased computational burden and increase job throughput, we are using Slurm [3] as our scheduler and resource manager. For storage, we have designed and implemented a multi-layer filesystem architecture to distribute a filesystem across multiple machines. These enhancements, which are entirely based on open source software, have extended the capabilities of our cluster and increased its cost-effectiveness. Slurm has numerous features that allow it to generalize to a number of different scenarios. Among the most notable is its support for GPU (graphics processing unit) scheduling. GPUs can offer a tremendous performance increase in machine learning applications [4] and Slurm’s built-in mechanisms for handling them was a key factor in making this choice. Slurm has a general resource (GRES) mechanism that can be used to configure and enable support for resources beyond the ones provided by the traditional HPC scheduler (e.g. memory, wall-clock time), and GPUs are among the GRES types that can be supported by Slurm [5]. In addition to being able to track resources, Slurm does strict enforcement of resource allocation. This becomes very important as the computational demands of the jobs increase, so that they have all the resources they need, and that they don’t take resources from other jobs. It is a common practice among GPU-enabled frameworks to query the CUDA runtime library/drivers and iterate over the list of GPUs, attempting to establish a context on all of them. Slurm is able to affect the hardware discovery process of these jobs, which enables a number of these jobs to run alongside each other, even if the GPUs are in exclusive-process mode. To store large quantities of digital pathology slides, we developed a robust, extensible distributed storage solution. We utilized a number of open source tools to create a single filesystem, which can be mounted by any machine on the network. At the lowest layer of abstraction are the hard drives, which were split into 4 60-disk chassis, using 8TB drives. To support these disks, we have two server units, each equipped with Intel Xeon CPUs and 128GB of RAM. At the filesystem level, we have implemented a multi-layer solution that: (1) connects the disks together into a single filesystem/mountpoint using the ZFS (Zettabyte File System) [6], and (2) connects filesystems on multiple machines together to form a single mountpoint using Gluster [7]. ZFS, initially developed by Sun Microsystems, provides disk-level awareness and a filesystem which takes advantage of that awareness to provide fault tolerance. At the filesystem level, ZFS protects against data corruption and the infamous RAID write-hole bug by implementing a journaling scheme (the ZFS intent log, or ZIL) and copy-on-write functionality. Each machine (1 controller + 2 disk chassis) has its own separate ZFS filesystem. Gluster, essentially a meta-filesystem, takes each of these, and provides the means to connect them together over the network and using distributed (similar to RAID 0 but without striping individual files), and mirrored (similar to RAID 1) configurations [8]. By implementing these improvements, it has been possible to expand the storage and computational power of the Neuronix cluster arbitrarily to support the most computationally-intensive endeavors by scaling horizontally. We have greatly improved the scalability of the cluster while maintaining its excellent price/performance ratio [1]. 
    more » « less