skip to main content

Title: Supporting Program Comprehension through Fast Query response in Large-Scale Systems
Software traceability provides support for various engineering activities including Program Comprehension; however, it can be challenging and arduous to complete in large industrial projects. Researchers have proposed automated traceability techniques to create, maintain and leverage trace links. Computationally intensive techniques, such as repository mining and deep learning, have showed the capability to deliver accurate trace links. The objective of achieving trusted, automated tracing techniques at industrial scale has not yet been successfully accomplished due to practical performance challenges. This paper evaluates high-performance solutions for deploying effective, computationally expensive traceability algorithms in large scale industrial projects and leverages generated trace links to answer Program Comprehension Queries. We comparatively evaluate four different platforms for supporting industrial-scale tracing solutions, capable of tackling software projects with millions of artifacts. We demonstrate that tracing solutions built using big data frameworks scale well for large projects and that our Spark implementation outperforms relational database, graph database (GraphDB), and plain Java implementations. These findings contradict earlier results which suggested that GraphDB solutions should be adopted for large-scale tracing problems.  more » « less
Award ID(s):
1649448 1901059
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC)
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Software traceability establishes associations between diverse software artifacts such as requirements, design, code, and test cases. Due to the non-trivial costs of manually creating and maintaining links, many researchers have proposed automated approaches based on information retrieval techniques. However, many globally distributed software projects produce software artifacts written in two or more languages. The use of intermingled languages reduces the efficacy of automated tracing solutions. In this paper, we first analyze and discuss patterns of intermingled language use across multiple projects, and then evaluate several different tracing algorithms including the Vector Space Model (VSM), Latent Semantic Indexing (LSI), Latent Dirichlet Allocation (LDA), and various models that combine mono- and cross-lingual word embeddings with the Generative Vector Space Model (GVSM). Based on an analysis of 14 Chinese-English projects, our results show that best performance is achieved using mono-lingual word embeddings integrated into GVSM with machine translation as a preprocessing step. 
    more » « less
  2. In many regulated domains, traceability is established across diverse artifacts such as requirements, design, code, test cases, and hazards -- either manually or with the help of supporting tools, and the resulting trace links are used to support activities such as impact analysis, compliance verification, and safety inspections. Automated tracing techniques need to leverage the semantics of underlying artifacts in order to establish more accurate trace links and to provide explanations of links that have been created in either a manual or automated fashion. To support this, we propose an automated technique which leverages source code, project artifacts and an external domain corpus to generate a domain-specific concept model. We then use the generated concept model to improve traceability results and to provide explanations of the results. Our approach overcomes existing problems with deep-learning traceability algorithms, as it does not require a training set of existing trace links. Finally, as an initial proof-of-concept, we apply our semantically-guided approach to the Dronology project, and show that it improves over other tracing techniques that do not use a concept model. 
    more » « less
  3. Software traceability establishes a network of connections between diverse artifacts such as requirements, design, and code. However, given the cost and effort of creating and maintaining trace links manually, researchers have proposed automated approaches using information retrieval techniques. Current approaches focus almost entirely upon generating links between pairs of artifacts and have not leveraged the broader network of interconnected artifacts. In this paper we investigate the use of intermediate artifacts to enhance the accuracy of the generated trace links - focusing on paths consisting of source, target, and intermediate artifacts. We propose and evaluate combinations of techniques for computing semantic similarity, scaling scores across multiple paths, and aggregating results from multiple paths. We report results from five projects, including one large industrial project. We find that leveraging intermediate artifacts improves the accuracy of end-to-end trace retrieval across all datasets and accuracy metrics. After further analysis, we discover that leveraging intermediate artifacts is only helpful when a project's artifacts share a common vocabulary, which tends to occur in refinement and decomposition hierarchies of artifacts. Given our hybrid approach that integrates both direct and transitive links, we observed little to no loss of accuracy when intermediate artifacts lacked a shared vocabulary with source or target artifacts. 
    more » « less
  4. Previous work has established both the importance and difficulty of establishing and maintaining adequate software traceability. While it has been shown to support essential maintenance and evolution tasks, recovering traceability links between related software artifacts is a time consuming and error prone task. As such, substantial research has been done to reduce this barrier to adoption by at least partially automating traceability link recovery. In particular, recent work has shown that supervised machine learning can be effectively used for automating traceability link recovery, as long as there is sufficient data (i.e., labeled traceability links) to train a classification model. Unfortunately, the amount of data required by these techniques is a serious limitation, given that most software systems rarely have traceability information to begin with. In this paper we address this limitation of previous work and propose an approach based on active learning, which substantially reduces the amount of training data needed by supervised classification approaches for traceability link recovery while maintaining similar performance. 
    more » « less
  5. null (Ed.)
    Software traceability establishes and leverages associations between diverse development artifacts. Researchers have proposed the use of deep learning trace models to link natural language artifacts, such as requirements and issue descriptions, to source code; however, their effectiveness has been restricted by availability of labeled data and efficiency at runtime. In this study, we propose a novel framework called Trace BERT (T-BERT) to generate trace links between source code and natural language artifacts. To address data sparsity, we leverage a three-step training strategy to enable trace models to transfer knowledge from a closely related Software Engineering challenge, which has a rich dataset, to produce trace links with much higher accuracy than has previously been achieved. We then apply the T-BERT framework to recover links between issues and commits in Open Source Projects. We comparatively evaluated accuracy and efficiency of three BERT architectures. Results show that a Single-BERT architecture generated the most accurate links, while a Siamese-BERT architecture produced comparable results with significantly less execution time. Furthermore, by learning and transferring knowledge, all three models in the framework outperform classical IR trace models. On the three evaluated real-word OSS projects, the best T-BERT stably outperformed the VSM model with average improvements of 60.31% measured using Mean Average Precision (MAP). RNN severely underperformed on these projects due to insufficient training data, while T-BERT overcame this problem by using pretrained language models and transfer learning. 
    more » « less