skip to main content

Title: TestSage: Regression Test Selection for Large-Scale Web Service Testing
Regression testing is an important but expensive activity in software development. Among various types of tests, web service tests are usually one of the most expensive (due to network communications) but widely adopted types of tests in commercial software development. Regression test selection (RTS) aims to reduce the number of tests which need to be retested by only running tests that are affected by code changes. Although a large number of RTS techniques have been proposed in the past few decades, these techniques have not been adopted on large-scale web service testing. This is because most existing RTS techniques either require direct code dependency between tests and code under test or cannot be applied on large scale systems with enough efficiency. In this paper, we present a novel RTS technique, TestSage, that performs RTS for web service tests on large scale commercial software. With a small overhead, TestSage is able to collect fine grained (function level) dependency between test and service under test that do not directly depend on each other. TestSage has also been successfully applied to large complex systems with over a million functions. We conducted experiments of TestSage on a large scale backend service at Google. Experimental more » results show that TestSage reduces 34% of testing time when running all AEC (Analysis, Execution and Collection) phases, 50% of testing time while running without collection phase. TestSage has been integrated with internal testing framework at Google and runs day-to-day at the company. « less
; ;
Award ID(s):
1763906 1718903
Publication Date:
Journal Name:
IEEE Conference on Software Testing, Validation and Verification (ICST)
Page Range or eLocation-ID:
430 to 440
Sponsoring Org:
National Science Foundation
More Like this
  1. Regression testing - running available tests after each project change - is widely practiced in industry. Despite its widespread use and importance, regression testing is a costly activity. Regression test selection (RTS) optimizes regression testing by selecting only tests affected by project changes. RTS has been extensively studied and several tools have been deployed in large projects. However, work on RTS over the last decade has mostly focused on languages with abstract computing machines(e.g., JVM). Meanwhile development practices (e.g., frequency of commits, testing frameworks, compilers) in C++ projects have dramatically changed and the way we should design and implement RTSmore »tools and the benefits of those tools is unknown. We present a design and implementation of an RTS technique, dubbed RTS++, that targets projects written in C++, which compile to LLVM IR and use the Google Test testing framework. RTS++ uses static analysis of a function call graph to select tests. RTS++ integrates with many existing build systems, including AutoMake, CMake, and Make. We evaluated RTS++ on 11 large open-source projects, totaling 3,811,916 lines of code. To the best of our knowledge, this is the largest evaluation of an RTS technique for C++. We measured the benefits of RTS++compared to running all available tests (i.e., retest-all). Our results show that RTS++ reduces the number of executed tests and end-to-end testing time by 88% and 61% on average.« less
  2. Regression testing - rerunning tests at each code version to detect newly-broken functionality - is important and widely practiced. But regression testing is costly due to the large number of tests and high frequency of code changes. Regression test selection (RTS) optimizes regression testing by rerunning only a subset of tests that can be affected by code changes. Researchers showed that RTS based on dynamic and static program analysis can save substantial testing time for (medium-sized) open-source projects. Simultaneously, practitioners showed that RTS based on machine learning (ML) is lightweight and works well on very large software repositories, e.g., inmore »Facebook’s monorepository. We combine analysis-based RTS and ML-based RTS by using ML-based RTS to choose a subset of tests selected by analysis-based RTS. To do so, we first design several novel ML-based RTS techniques that leverage mutation analysis to obtain a training set for learning the impact of code changes on test outcomes. Then, we empirically evaluate, using 10 projects, the benefits of combining various ML models with analysis-based RTS. We also compare combining the techniques with using each technique individually. Combining ML-based RTS with two analysis-based RTS techniques - Ekstazi and STARTS - selects 25.34% and 21.44% fewer tests.« less
  3. Motivation: The question of what combination of attributes drives the adoption of a particular software technology is critical to developers. It determines both those technologies that receive wide support from the community and those which may be abandoned, thus rendering developers' investments worthless. Aim and Context: We model software technology adoption by developers and provide insights on specific technology attributes that are associated with better visibility among alternative technologies. Thus, our findings have practical value for developers seeking to increase the adoption rate of their products. Approach: We leverage social contagion theory and statistical modeling to identify, define, and testmore »empirically measures that are likely to affect software adoption. More specifically, we leverage a large collection of open source repositories to construct a software dependency chain for a specific set of R language source-code files. We formulate logistic regression models, where developers' software library choices are modeled, to investigate the combination of technological attributes that drive adoption among competing data frame (a core concept for a data science languages) implementations in the R language: tidy and data.table. To describe each technology, we quantify key project attributes that might affect adoption (e.g., response times to raised issues, overall deployments, number of open defects, knowledge base) and also characteristics of developers making the selection (performance needs, scale, and their social network). Results: We find that a quick response to raised issues, a larger number of overall deployments, and a larger number of high-score StackExchange questions are associated with higher adoption. Decision makers tend to adopt the technology that is closer to them in the technical dependency network and in author collaborations networks while meeting their performance needs. To gauge the generalizability of the proposed methodology, we investigate the spread of two popular web JavaScript frameworks Angular and React, and discuss the results. Future work: We hope that our methodology encompassing social contagion that captures both rational and irrational preferences and the elucidation of key measures from large collections of version control data provides a general path toward increasing visibility, driving better informed decisions, and producing more sustainable and widely adopted software.« less
  4. Configuration changes are among the dominant causes of failures of large-scale software system deployment. Given the velocity of configuration changes, typically at the scale of hundreds to thousands of times daily in modern cloud systems, checking these configuration changes is critical to prevent failures due to misconfigurations. Recent work has proposed configuration testing, Ctest, a technique that tests configuration changes together with the code that uses the changed configurations. Ctest can automatically generate a large number of ctests that can effectively detect misconfigurations, including those that are hard to detect by traditional techniques. However, running ctests can take a longmore »time to detect misconfigurations. Inspired by traditional test-case prioritization (TCP) that aims to reorder test executions to speed up detection of regression code faults, we propose to apply TCP to reorder ctests to speed up detection of misconfigurations. We extensively evaluate a total of 84 traditional and novel ctest-specific TCP techniques. The experimental results on five widely used cloud projects demonstrate that TCP can substantially speed up misconfiguration detection. Our study provides guidelines for applying TCP to configuration testing in practice.« less
  5. The Tweet Collection Management (TWT) Team aims to ingest 5 billion tweets, clean this data, analyze the metadata present, extract key information, classify tweets into categories, and finally, index these tweets into Elasticsearch to browse and query. The main deliverable of this project is a running software application for searching tweets and for viewing Twitter collections from Digital Library Research Laboratory (DLRL) event archive projects. As a starting point, we focused on two development goals: (1) hashtag-based and (2) username-based search for tweets. For IR1, we completed extraction of two fields within our sample collection: hashtags and username. Sample codemore »for TwiRole, a user-classification program, was investigated for use in our project. We were able to sample from multiple collections of tweets, spanning topics like COVID-19 and hurricanes. Initial work encompassed using a sample collection, provided via Google Drive. An NFS-based persistent storage was later involved to allow access to larger collections. In total, we have developed 9 services to extract key information like username, hashtags, geo-location, and keywords from tweets. We have also developed services to allow for parsing and cleaning of raw API data, and backup of data in an Apache Parquet filestore. All services are Dockerized and added to the GitLab Container Registry. The services are deployed in the CS cloud cluster to integrate services into the full search engine workflow. A service is created to convert WARC files to JSON for reading archive files into the application. Unit testing of services is complete and end-to-end tests have been conducted to improve system robustness and avoid failure during deployment. The TWT team has indexed 3,200 tweets into the Elasticsearch index. Future work could involve parallelization of the extraction of metadata, an alternative feature-flag approach, advanced geo-location inference, and adoption of the DMI-TCAT format. Key deliverables include a data body that allows for search, sort, filter, and visualization of raw tweet collections and metadata analysis; a running software application for searching tweets and for viewing Twitter collections from Digital Library Research Laboratory (DLRL) event archive projects; and a user guide to assist those using the system.« less