skip to main content


Title: An infrastructure approach to improving effectiveness of Android UI testing tools
Due to the importance of Android app quality assurance, many Android UI testing tools have been developed by researchers over the years. However, recent studies show that these tools typically achieve low code coverage on popular industrial apps. In fact, given a reasonable amount of run time, most state-of-the-art tools cannot even outperform a simple tool, Monkey, on popular industrial apps with large codebases and sophisticated functionalities. Our motivating study finds that these tools perform two types of operations, UI Hierarchy Capturing (capturing information about the contents on the screen) and UI Event Execution (executing UI events, such as clicks), often inefficiently using UIAutomator, a component of the Android framework. In total, these two types of operations use on average 70% of the given test time. Based on this finding, to improve the effectiveness of Android testing tools, we propose TOLLER, a tool consisting of infrastructure enhancements to the Android operating system. TOLLER injects itself into the same virtual machine as the app under test, giving TOLLER direct access to the app’s runtime memory. TOLLER is thus able to directly (1) access UI data structures, and thus capture contents on the screen without the overhead of invoking the Android framework services or remote procedure calls (RPCs), and (2) invoke UI event handlers without needing to execute the UI events. Compared with the often-used UIAutomator, TOLLER reduces average time usage of UI Hierarchy Capturing and UI Event Execution operations by up to 97% and 95%, respectively. We integrate TOLLER with existing state-of-the-art/practice Android UI testing tools and achieve the range of 11.8% to 70.1% relative code coverage improvement on average. We also find that TOLLER-enhanced tools are able to trigger 1.4x to 3.6x distinct crashes compared with their original versions without TOLLER enhancement. These improvements are so substantial that they also change the relative competitiveness of the tools under empirical comparison. Our findings highlight the practicality of TOLLER as well as raising the community awareness of infrastructure support’s significance beyond the community’s existing heavy focus on algorithms.  more » « less
Award ID(s):
1816615
NSF-PAR ID:
10299903
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
30th ACM SIGSOFT International Symposium on Software Testing and Analysis
Page Range / eLocation ID:
165 to 176
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Despite over a decade of research, it is still challenging for mobile UI testing tools to achieve satisfactory effectiveness, especially on industrial apps with rich features and large code bases. Our experiences suggest that existing mobile UI testing tools are prone to exploration tarpits, where the tools get stuck with a small fraction of app functionalities for an extensive amount of time. For example, a tool logs out an app at early stages without being able to log back in, and since then the tool gets stuck with exploring the app's pre-login functionalities (i.e., exploration tarpits) instead of its main functionalities. While tool vendors/users can manually hardcode rules for the tools to avoid specific exploration tarpits, these rules can hardly generalize, being fragile in face of diverted testing environments and fast app iterations. To identify and resolve exploration tarpits, we propose VET, a general approach including a supporting system for the given specific Android UI testing tool on the given specific app under test (AUT). VET runs the tool on the AUT for some time and records UI traces, based on which VET identifies exploration tarpits by recognizing their patterns in the UI traces. VET then pinpoints the actions (e.g., clicking logout) or the screens that lead to or exhibit exploration tarpits. In subsequent test runs, VET guides the testing tool to prevent or recover from exploration tarpits. From our evaluation with state-of-the-art Android UI testing tools on popular industrial apps, VET identifies exploration tarpits that cost up to 98.6% testing time budget. These exploration tarpits reveal not only limitations in UI exploration strategies but also defects in tool implementations. VET automatically addresses the identified exploration tarpits, enabling each evaluated tool to achieve higher code coverage and improve crash-triggering capabilities. 
    more » « less
  2. Core features (functionalities) of an app can often be accessed and invoked in several ways, i.e., through alternative sequences of user-interface (UI) interactions. Given the manual effort of writing tests, developers often only consider the typical way of invoking features when creating the tests (i.e., the “sunny day scenario”). However, the alternative ways of invoking a feature are as likely to be faulty. These faults would go undetected without proper tests. To reduce the manual effort of creating UI tests and help developers more thoroughly examine the features of apps, we presentRoute, an automated tool for feature-based UI test augmentation for Android apps.Routefirst takes a UI test and the app under test as input. It then applies novel heuristics to find additional high-quality UI tests, consisting of both inputs and assertions, that verify the same feature as the original test in alternative ways. Application ofRouteon several dozen tests for popular apps on Google Play shows that for 96% of the existing tests,Routewas able to generate at least one alternative test. Moreover, the fault detection effectiveness of augmented test suites in our experiments showed substantial improvements of up to 39% over the original test suites.

     
    more » « less
  3. Writing and maintaining UI tests for mobile apps is a time-consuming and tedious task. While decades of research have produced auto- mated approaches for UI test generation, these approaches typically focus on testing for crashes or maximizing code coverage. By contrast, recent research has shown that developers prefer usage-based tests, which center around specific uses of app features, to help support activities such as regression testing. Very few existing techniques support the generation of such tests, as doing so requires automating the difficult task of understanding the semantics of UI screens and user inputs. In this paper, we introduce Avgust, which automates key steps of generating usage-based tests. Avgust uses neural models for image understanding to process video recordings of app uses to synthesize an app-agnostic state-machine encoding of those uses. Then, Avgust uses this encoding to synthesize test cases for a new target app. We evaluate Avgust on 374 videos of common uses of 18 popular apps and show that 69% of the tests Avgust generates successfully execute the desired usage, and that Avgust’s classifiers outperform the state of the art. 
    more » « less
  4. The test suites of an Android app should take advantage of different types of tests including end-to-end tests, which validate user flows, and unit tests, which provide focused executions for debugging. App developers have two main options when creating unit tests: create unit tests that run on a device (either physical or emulated) or create unit tests that run on a development machine’s Java Virtual Machine (JVM). Unit tests that run on a device are not really focused, as they use the full implementation of the Android framework. Moreover, they are fairly slow to execute, requiring the Android system as the runtime. Unit tests that run on the JVM, instead, are more focused and run more efficiently but require developers to suitably handle the coupling between the app under test and the Android framework. To help developers in creating focused unit tests that run on the JVM, we propose a novel technique called ARTISAN based on the idea of test carving. The technique (i) traces the app execution during end-to-end testing on Android devices, (ii) identifies focal methods to test, (iii) carves the necessary preconditions for testing those methods, (iv) creates suitable test doubles for the Android framework, and (v) synthesizes executable unit tests that can run on the JVM. We evaluated ARTISAN using 152 end-to-end tests from five apps and observed that ARTISAN can generate unit tests that cover a significant portion of the code exercised by the end-to-end tests (i.e., 45% of the starting statement coverage on average) and does so in a few minutes. 
    more » « less
  5. Mobile devices are becoming the default platform for multimedia content consumption. Such a thriving business ecosystem has drawn interests from content distributors to develop apps that can reach a large number of audience. The business-edge of content delivery apps crucially relies on being able to effectively arbitrate the purchase and delivery of contents, and govern the access of contents with respect to usage control policies, on a plethora of consumer devices. Content protection on mobile platforms, especially in the absence of Trusted Execution Environment (TEE), is a challenging endeavor where developers often have to resort to ad-hoc deterrence-based defenses. This work evaluates the effectiveness of content protection mechanisms embraced by vendors of content delivery apps, with respect to a hierarchy of adversaries with varying real-world capabilities. Our analysis of 141 vulnerable apps uncovered that, in many cases, due to developers’ unjustified trust assumptions about the underlying technologies, adversaries can obtain unauthorized and unrestricted access to contents of apps, sometimes without even needing to reverse engineer the deterrence-based defenses. Some weaknesses in the apps can also severely impact app users’ security and privacy. All our "findings have been responsibly disclosed to the corresponding app vendors. 
    more » « less