skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on April 4, 2026

Title: “Can You Mimic Me? Exploring the Use of Android Record & Replay Tools in Debugging”
Android User Interface (UI) testing has emerged as an important and prevalent research topic due to the ubiquity of apps and the unique challenges faced by developers in this software domain. One popular topic of research that aims to facilitate both manual and automated UI testing and debugging processes is record and replay (R&R) tools. These tools allow for the recording of UI actions to facilitate the execution of test scenarios and the replay of various types of bugs. R&R tools typically support three main settings: (i) UI regression testing via R&R of feature-based execution scenarios, (ii) R&R of non- crashing functional bugs (e.g., in crowdsourced settings), and (iii) R&R of crashing bugs. Despite the progress made in research related to R&R tools, prior work examined only the effectiveness of these tools in disparate or fragmented settings. As such, the research community currently lacks a comprehensive examination of the effectiveness of existing tools across their common use cases and the potential key limitations that emerge. We address this current gap in knowledge by conducting a thorough empirical study on using R&R tools to manually record and replay feature-based user scenarios, non-crashing failures, and crashing bugs. Additionally, we explore the possibility of using R&R tools in conjunction with automated input genera- tion (AIG) tools to automatically record and replay crashing bugs. Our study context includes one industrial and three academic R&R tools, 34 user scenarios from 17 apps, 90 non-crashing failures from 42 Android apps, and 31 crashing bugs from 17 Android apps. Our results illustrate that 17% of user scenarios, 38% of non-crashing failures, and 44% of crashing bugs are not able to be reliably recorded and replayed, with the most prevalent reasons for non-replayability being action interval resolution, incompatibility related to APIs, and limitations in Android tooling. Our findings reveal important research directions for R&R tools to facilitate their practical application and adoption.  more » « less
Award ID(s):
2235136
PAR ID:
10585524
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Proceedings of 12th International Conference on Mobile Software Engineering and Systems
Date Published:
Subject(s) / Keyword(s):
UI testing, mobile testing, trace analysis, Android, debugging
Format(s):
Medium: X
Location:
International Conference on Mobile Software Engineering and Systems 2025
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Due to the importance of Android app quality assurance, many Android UI testing tools have been developed by researchers over the years. However, recent studies show that these tools typically achieve low code coverage on popular industrial apps. In fact, given a reasonable amount of run time, most state-of-the-art tools cannot even outperform a simple tool, Monkey, on popular industrial apps with large codebases and sophisticated functionalities. Our motivating study finds that these tools perform two types of operations, UI Hierarchy Capturing (capturing information about the contents on the screen) and UI Event Execution (executing UI events, such as clicks), often inefficiently using UIAutomator, a component of the Android framework. In total, these two types of operations use on average 70% of the given test time. Based on this finding, to improve the effectiveness of Android testing tools, we propose TOLLER, a tool consisting of infrastructure enhancements to the Android operating system. TOLLER injects itself into the same virtual machine as the app under test, giving TOLLER direct access to the app’s runtime memory. TOLLER is thus able to directly (1) access UI data structures, and thus capture contents on the screen without the overhead of invoking the Android framework services or remote procedure calls (RPCs), and (2) invoke UI event handlers without needing to execute the UI events. Compared with the often-used UIAutomator, TOLLER reduces average time usage of UI Hierarchy Capturing and UI Event Execution operations by up to 97% and 95%, respectively. We integrate TOLLER with existing state-of-the-art/practice Android UI testing tools and achieve the range of 11.8% to 70.1% relative code coverage improvement on average. We also find that TOLLER-enhanced tools are able to trigger 1.4x to 3.6x distinct crashes compared with their original versions without TOLLER enhancement. These improvements are so substantial that they also change the relative competitiveness of the tools under empirical comparison. Our findings highlight the practicality of TOLLER as well as raising the community awareness of infrastructure support’s significance beyond the community’s existing heavy focus on algorithms. 
    more » « less
  2. null (Ed.)
    Despite over a decade of research, it is still challenging for mobile UI testing tools to achieve satisfactory effectiveness, especially on industrial apps with rich features and large code bases. Our experiences suggest that existing mobile UI testing tools are prone to exploration tarpits, where the tools get stuck with a small fraction of app functionalities for an extensive amount of time. For example, a tool logs out an app at early stages without being able to log back in, and since then the tool gets stuck with exploring the app's pre-login functionalities (i.e., exploration tarpits) instead of its main functionalities. While tool vendors/users can manually hardcode rules for the tools to avoid specific exploration tarpits, these rules can hardly generalize, being fragile in face of diverted testing environments and fast app iterations. To identify and resolve exploration tarpits, we propose VET, a general approach including a supporting system for the given specific Android UI testing tool on the given specific app under test (AUT). VET runs the tool on the AUT for some time and records UI traces, based on which VET identifies exploration tarpits by recognizing their patterns in the UI traces. VET then pinpoints the actions (e.g., clicking logout) or the screens that lead to or exhibit exploration tarpits. In subsequent test runs, VET guides the testing tool to prevent or recover from exploration tarpits. From our evaluation with state-of-the-art Android UI testing tools on popular industrial apps, VET identifies exploration tarpits that cost up to 98.6% testing time budget. These exploration tarpits reveal not only limitations in UI exploration strategies but also defects in tool implementations. VET automatically addresses the identified exploration tarpits, enabling each evaluated tool to achieve higher code coverage and improve crash-triggering capabilities. 
    more » « less
  3. Continuous integration (CI) has become a popular method for automating code changes, testing, and software project delivery. However, sufficient testing prior to code submission is crucial to prevent build breaks. Additionally, testing must provide developers with quick feedback on code changes, which requires fast testing times. While regression test selection (RTS) has been studied to improve the cost-effectiveness of regression testing for lower-level tests (i.e., unit tests), it has not been applied to the testing of user interfaces (UI) in application domains such as mobile apps. UI testing at the UI level requires different techniques such as impact analysis and automated test execution. In this paper, we examine the use of RTS in CI settings for UI testing across various open-source mobile apps. Our analysis focuses on using Frequency Analysis to understand the need for RTS, Cost Analysis to evaluate the cost of impact analysis and test case selection algorithms, and Test Reuse Analysis to determine the reusability of UI test sequences for automation. The insights from this study will guide practitioners and researchers in developing advanced RTS techniques that can be adapted to CI environments for mobile apps. 
    more » « less
  4. When mobile apps are used extensively in our daily lives, their responsiveness has become an important factor that can negatively impact the user experience. The long response time of a mobile app can be caused by a variety of reasons, including soft hang bugs or prolonged user interface APIs (UI-APIs). While hang bugs have been researched extensively before, our investigation on UI-APIs in today’s mobile OS finds that the recursive construction of UI view hierarchy often can be time-consuming, due to the complexity of today’s UI views. To accelerate UI processing, such complex views can be pre-processed and cached before the user even visits them. However, pre-caching every view in a mobile app is infeasible due to the incurred overheads on time, energy, and cache space. In this paper, we propose MAPP, a framework for Mobile App Predictive Pre-caching. MAPP has two main modules, 1) UI view prediction based on deep learning and 2) UI-API pre-caching, which coordinate to improve the responsiveness of mobile apps. MAPP adopts a per-user and per-app prediction model that is tailored based on the analysis of collected user traces, such as location, time, or the sequence of previously visited views. A dynamic feature ranking and model selection algorithm is designed to judiciously filter out less relevant features for improving the prediction accuracy with less computation overhead. MAPP is evaluated with 61 real-world traces from 18 volunteers over 30 days to show that it can shorten the response time of mobile apps by 59.84% on average with an average cache hit rate of 92.55%. 
    more » « less
  5. Wearable devices, such as smart watches and fitness trackers are growing in popularity, creating a need for application developers to adapt or extend a UI, typically from a smartphone, onto these devices. Wearables generally have a smaller form factor than a phone; thus, porting an app to the watch necessarily involves reworking the UI. An open problem is identifying best practices for adapting UIs to wearable devices. This paper contributes a study and data set of the state of practice in UI adaptation for wearables. We automatically extract UI designs from a set of 101 popular Android apps that have both a phone and watch version, and manually label how each UI element, as well as how screens in the app, are translated from the phone to the wearable. The paper identifies trends in adaptation strategies and presents design guidelines. We expect that the UI adaptation strategies identified in this paper can have wide-ranging impacts for future research and identifying best practices in this space, such as grounding future user studies that evaluate which strategies improve user satisfaction or automatically adapting UIs. 
    more » « less