<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Tyche: Making Sense of Property-Based Testing Effectiveness</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>10/11/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10585033</idno>
					<idno type="doi">10.1145/3654777.3676407</idno>
					
					<author>Harrison Goldstein</author><author>Jeffrey Tao</author><author>Zac Hatfield-Dodds</author><author>Benjamin C Pierce</author><author>Andrew Head</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Software developers increasingly rely on automated methods to assess the correctness of their code. One such method is property-based testing (PBT), wherein a test harness generates hundreds or thousands of inputs and checks the outputs of the program on those inputs using parametric properties. Though powerful, PBT induces a sizable gulf of evaluation: developers need to put in nontrivial effort to understand how well the different test inputs exercise the software under test. To bridge this gulf, we propose Tyche, a user interface that supports sensemaking around the effectiveness of property-based tests. Guided by a formative design exploration, our design of Tyche supports developers with interactive, configurable views of test behavior with tight integrations into modern developer testing workflow. These views help developers explore global testing behavior and individual test inputs alike. To accelerate the development of powerful, interactive PBT tools, we define a standard for PBT test reporting and integrate it with a widely used PBT library. A self-guided online usability study revealed that Tyche’s visualizations help developers to more accurately assess software testing effectiveness.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ABSTRACT</head><p>Software developers increasingly rely on automated methods to assess the correctness of their code. One such method is propertybased testing (PBT), wherein a test harness generates hundreds or thousands of inputs and checks the outputs of the program on those inputs using parametric properties. Though powerful, PBT induces a sizable gulf of evaluation: developers need to put in nontrivial effort to understand how well the different test inputs exercise the software under test. To bridge this gulf, we propose Tyche, a user interface that supports sensemaking around the effectiveness of property-based tests. Guided by a formative design exploration, our design of Tyche supports developers with interactive, configurable views of test behavior with tight integrations into modern developer testing workflow. These views help developers explore global</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Software developers work hard to build systems that behave as intended. But software is rarely 100% correct when first implemented, so developers also write tests to validate their work, detect bugs, and check that bugs stay fixed. Traditionally, these tests take the form of manually written "example-based" tests, where developers write out specific sample inputs together with expected outputs; but this process is labor-intensive and can miss bugs in cases the programmer did not think to check. Instead, some programmers have adopted automated techniques to supplement or replace examplebased tests. One such technique is property-based testing (PBT), which automatically samples many program inputs from a random distribution and checks, for each one, that the system's behavior satisfies a set of developer-provided properties. Used well, this leads to testing that is more thorough and less laborious; indeed, PBT has proven effective in identifying subtle bugs in a wide variety of real-world settings, including telecommunications software <ref type="bibr">[3]</ref>, replicated file and key-value stores <ref type="bibr">[6,</ref><ref type="bibr">41]</ref>, automotive software <ref type="bibr">[4]</ref>, and other complex systems <ref type="bibr">[40]</ref>.</p><p>Of course, automation comes with tradeoffs, and PBT is no exception. In PBT, there are often too many randomly generated test inputs for a developer to understand at once, creating a gulf of evaluation <ref type="bibr">[67]</ref> for test suite quality. Indeed, in a recent study of the human factors of PBT <ref type="bibr">[25]</ref>, developers reported having difficulty understanding what was really being tested.</p><p>For example, suppose a developer is testing some mathematical function using randomly generated floating-point numbers. The developer might have a variety of questions about their test suite quality. They might ask if the distribution is broad enough (e.g., is it stuck between 0 and 1), or too broad (e.g., does it span all possible floats, even if the function can only take positive ones). Or they may wonder if the distribution misses corner-cases like 0 or -1. Perhaps most importantly, they may want to know if the data generator produces too many malformed or invalid test inputs (e.g., NaN) that cannot be used for testing at all. State-of-the-art PBT tooling does not give adequate tools for answering these kinds of questions: any of these erroneous situations could go unnoticed because necessary information is not apparent to the user. As a result, developers may not realize that their tests are not thoroughly exercising some important system behaviors.</p><p>This gulf of evaluation presents an opportunity to rethink user interfaces for testing. HCI has made strides in helping developers make sense of large amounts of structured program data, whether by revealing patterns that manifest in many programs <ref type="bibr">[20,</ref><ref type="bibr">22,</ref><ref type="bibr">32,</ref><ref type="bibr">93]</ref> or comparing the behavior of program variants <ref type="bibr">[83,</ref><ref type="bibr">86,</ref><ref type="bibr">96]</ref>. As developers adopt PBT, it is critical to tackle the related problem of helping programmers understand a summary of hundreds or thousands of executions of a single test.</p><p>To address this problem, we propose Tyche, 1 an interface that supports sensemaking and exploration for distributions of test inputs. Tyche's design was inspired by a review of recent PBT usability research and refined through iterative design with the help of expert PBT users; this refinement identified design principles that clarify the information needs of PBT users. Tyche provides users with an ensemble of visualizations, and, while each individual visualization is well-understood by UI researchers, the specific combination of visualizations is novel and fine-tuned to the PBT setting. Tyche's visualizations provide high-level insight about the distribution of test inputs as well as various aspects of test efficiency (see Figure <ref type="figure">1</ref>). Tyche also supports visualization and rapid drill-down into input data, taking advantage of existing hooks in PBT libraries. 1 Named after the Greek goddess of randomness.</p><p>To understand whether Tyche actually changes how developers understand their tests, we conducted a 40-participant self-guided, online study. In this study, participants were asked to view test distributions and rank them according to their power to identify program defects. Compared to using a standard tools, Tyche helped developers make to better judgments about their test distributions.</p><p>To encourage broad adoption of Tyche, we define OpenPBT-Stats, a standard format for reporting results of PBT. When a PBT framework generates data in this format, its results can be viewed in Tyche and perhaps other interfaces supporting the same standard in the future. We integrated OpenPBTStats into the main branch of Hypothesis <ref type="bibr">[55]</ref>, the most widely-used PBT framework, showing the way forward for other frameworks.</p><p>After discussing background ( &#167;2) and related work ( &#167;3), we offer the following contributions:</p><p>&#8226; We articulate design considerations for Tyche, motivated by a formative study with experienced PBT users. ( &#167;4) &#8226; We detail the design of Tyche, an interface that helps developers evaluate the quality of their testing with an ensemble of visualizations fine-tuned to PBT with lightweight affordances to support exploration. ( &#167;5) &#8226; We define the OpenPBTStats format for collecting and reporting PBT data to help standardize testing evaluation across different PBT frameworks. ( &#167;6) &#8226; We evaluate Tyche in an online study, demonstrating that Tyche guides developers to significantly better assessments of test suite effectiveness. ( &#167;7) We conclude with directions for future work ( &#167;8), including other automated testing disciplines that can benefit from Tyche and related interfaces.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">BACKGROUND</head><p>We begin by describing property-based testing and reviewing what is known about its usability and contexts of use.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Property-Based Testing</head><p>In traditional unit testing, developers think up examples that demonstrate the ways a function is supposed to behave, writing one test for each envisioned behavior. For example, this unit test checks that inserting a key "k" and value 0 into a red-black tree <ref type="bibr">[90]</ref> and then looking up "k" results in the value 0: def test_insert_example(): t = Empty() assert lookup(insert("k", 0, t), "k") == 0</p><p>If one wanted to test more thoroughly, they could painstakingly write dozens of tests like this for many different example trees, keys, and values. Property-based testing offers an alternative, succinct way to express many tests at once: @given(trees(), integers(), integers()) def test_insert_lookup(t, k, v): assume(is_red_black_tree(t)) assert lookup(insert(k, v, t), k) == v</p><p>This test is written in Hypothesis <ref type="bibr">[55]</ref>, a popular PBT library in Python. It randomly generates triples of trees, keys, and values, and for each triple, checks a parameterized property that resembles a unit-test assertion-that the inserted value is in the tree. This single test specification represents a massive collection of concrete individual tests, and using it can lead to more thorough testing (compared to a unit test suite), since the random generator may produce examples the user had not thought of.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">PBT Process and Pitfalls</head><p>At its core, the practice of PBT involves three distinct steps: defining executable properties, constructing random input generators, <ref type="foot">2</ref> and reviewing the result of checking these properties against a large number of sampled inputs; challenges can arise at any of these stages. There is significant technical research into on each of the first two stages <ref type="bibr">[26, 48, 50, 52, etc.]</ref>. We are focused here on the third stage: helping developers review the results of testing, in part to support the (often iterative) process of refining and improving the generators constructed in the second step. For instance, in the example above, a developer might accidentally write a trees() generator that only produces the Empty() tree, in which case their property will be checked only against a single test input (over and over). Or, if the generator's strategy is not quite so broken but still too na&#239;ve, it might fail to produce very many trees that actually pass the assume(is_red_black_tree(t)) guard.</p><p>In cases like these, developers need to remember that, although all their tests are succeeding, this does not necessarily mean their code is correct <ref type="bibr">[53]</ref>: they may need to improve their generators to start seeing failing tests. Unfortunately, with conventional PBT tools, developers may feel they don't have easy access to this knowledge <ref type="bibr">[25]</ref>. While the programming languages community is continually developing better techniques for generating well-distributed inputs <ref type="bibr">[28, 52, 62, 80, etc.]</ref>, developers still need to be able to check that the generators they are using are actually fit for the job.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">RELATED WORK</head><p>In this section we situate our work on Tyche within the larger area of programming tools research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Current Affordances</head><p>What support do PBT frameworks provide today for developers to inspect test input distributions? We surveyed the state of practice in the most popular PBT frameworks (by GitHub stars) across six different languages: Python <ref type="bibr">[55]</ref>, TypeScript / JavaScript <ref type="bibr">[17]</ref>, Rust <ref type="bibr">[19]</ref>, Scala <ref type="bibr">[66]</ref>, Java <ref type="bibr">[37]</ref>, and Haskell <ref type="bibr">[12]</ref>. These frameworks provide users with the following kinds of information. (A detailed comparison of framework features can be found in Appendix A.)</p><p>Raw Examples. All of these frameworks could print generated inputs to the terminal. Some (3/6) provided a flag or option to do so; the others did not provide this feature natively, although users might simply print examples to the terminal themselves.</p><p>Number of Tests Run vs. Discarded. Many frameworks (4/6) report how many examples were run vs. discarded (because they did not pass a quality filter). Sometimes (2/6), this information is hidden behind a command line flag.</p><p>Event / Label Aggregation. Many frameworks (4/6) could report aggregates of user-defined features of the examples-e.g., lengths of generated lists. Information about such features typically appeared in a simple textual list or table, as in this example from QuickCheck <ref type="bibr">[81]</ref>:</p><p>7% length of input is 7 6% length of input is 3 5% length of input is 4 ...</p><p>where this output conveys that among the generated lists for some test run, 7% were 7 elements long, 6% were 3 elements long, etc. Time. One framework reported how long the test run took.</p><p>Warnings. One framework provided warnings about test distributions, in particular warning users when their generators produced a very high proportion of discarded examples.</p><p>The affordances for evaluation in existing frameworks are situationally useful, but inconsistently implemented and incomplete. In &#167;5 and &#167;6 we discuss how Tyche improves on the state of the art.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Interactive Tools for Testing</head><p>Some of the earliest research on improved interfaces for testing focused on spreadsheets. Rothermel et al. <ref type="bibr">[72]</ref> proposed a model of testing called "what you see is what you test" (WYSIWYT), wherein users "test" their spreadsheet by checking values that they see and marking them as correct. This approach has appeared in many domains of programming, including visual dataflow <ref type="bibr">[45]</ref> and screen transition languages <ref type="bibr">[7]</ref>. Complementary to WYSIWYT are features that encourage programmers' curiosity <ref type="bibr">[91]</ref>, for instance by detecting and calling attention to likely anomalies <ref type="bibr">[61,</ref><ref type="bibr">91]</ref>. Many of the testing tools developed by the HCI community have sought to accelerate manual testing with rich, explorable traces of program behavior <ref type="bibr">[8,</ref><ref type="bibr">15,</ref><ref type="bibr">57,</ref><ref type="bibr">58,</ref><ref type="bibr">68]</ref>. These tools instrument a program, record its behavior during execution, and then provide visualizations of data and augmentations to source code to help programmers pinpoint what is going wrong in their code. Tools can also help programmers create automated tests from user demonstrations. For instance, Sikuli Test <ref type="bibr">[10]</ref> lets application developers create automated tests of interfaces by demonstrating a usage flow with the interface and then entering assertions of what interface elements should or should not be on the screen at the end of the flow.</p><p>Recent research has explored new ways to bring users into the loop of randomized testing. One research system, NaNoFuzz <ref type="bibr">[14]</ref>,</p><p>shows programmers examples of program outputs and helps them to notice problematic results like NaN or crash failures. NaNoFuzz is superficially the closest comparison available for Tyche, but the two serve different, complementary purposes. NanoFuzz's strengths reside in calling attention to failures; Tyche's strengths reside in exposing patterns in input distributions. One could imagine a user leveraging both in concert during the testing process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Making Sense of Program Executions</head><p>In a broad sense, Tyche's aim is to help developers reason about the behavior of a program across many executions. This problem has been explored by the HCI community. Tools have been developed to reveal the behavior of a program over many synthesized examples <ref type="bibr">[95]</ref>, and of an expression over many loops <ref type="bibr">[34,</ref><ref type="bibr">44,</ref><ref type="bibr">54,</ref><ref type="bibr">77]</ref>. The problem of understanding input distributions has been of interest in the area of AI interpretability, where tools have been built to support inspection of input distributions and corresponding outputs (e.g., <ref type="bibr">[35,</ref><ref type="bibr">36]</ref>). Tyche's aim is to tailor data views and exploration mechanisms to tightly fit the concerns and context of randomized testing with professional-grade software and potentially-complex inputs (e.g., logs, trees).</p><p>Prior work has sought to help programmers make sense of similarities and differences across sets of programs. Some of these tools cluster programs on the basis of syntax, semantics, or structure <ref type="bibr">[21,</ref><ref type="bibr">23,</ref><ref type="bibr">32,</ref><ref type="bibr">94]</ref>. Others highlight differences in the source and/or behavior of program variants <ref type="bibr">[73,</ref><ref type="bibr">83,</ref><ref type="bibr">86,</ref><ref type="bibr">96]</ref>. Tyche itself does some lightweight clustering of test cases (in this case, input examples), and affordances for program differencing could be brought to Tyche to help programmers pinpoint where some instantiations of a property fail and others succeed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Formal Methods in the Editor</head><p>Property-based testing can be seen as a kind of lightweight formal method <ref type="bibr">[92]</ref>, in that it allows programmers to specify precisely the behavior of their program and then verify that the specification is satisfied. Tyche joins a family of research projects that bring formal methods into the interactive editing experience, whether to support repetitive edits <ref type="bibr">[56,</ref><ref type="bibr">65]</ref>, code search <ref type="bibr">[63]</ref>, program synthesis <ref type="bibr">[16,</ref><ref type="bibr">70,</ref><ref type="bibr">87,</ref><ref type="bibr">95]</ref>, or bidirectional editing of programs and outputs <ref type="bibr">[33]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">FORMATIVE RESEARCH</head><p>Our design for Tyche is informed by formative research into the user experience of PBT. Below, we describe our methods for formative research ( &#167;4.1), followed by a crystallization of user needs ( &#167;4.2) and a set of design considerations for Tyche ( &#167;4.3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Methods</head><p>To better understand what developers need in understanding their PBT distributions, we drew on our recently published related work and then iterated with design feedback from users.</p><p>4.1.1 Review of related work. Our baseline understanding of user needs relating to evaluating testing effectiveness came from our recent study <ref type="bibr">[25]</ref> on the human factors of PBT.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.2">Iterative design feedback.</head><p>As we developed Tyche, we continually sought and integrated feedback on its design from experienced users of PBT. We recruited 5 such users through X (formerly Twitter) and our personal networks. We refer to them as P1-P5.</p><p>For each of these users, we conducted a 1-hour observation and interview session. Each session was split into two parts. In the first part, participants showed us PBT tests they had written, described those tests, and answered questions about how they evaluate (or could evaluate) whether those tests are effective. In the second part, participants installed our then-current prototype and used it to explore the effectiveness of their own tests. <ref type="foot">3</ref> Study sessions were staggered throughout the design process. We altered the design to incorporate feedback after each session. Initial prototype. All Tyche prototypes were developed as VS-Code <ref type="bibr">[60]</ref> extensions. All prototypes focused on providing visual summaries of PBT data in a web view pane in the editor. The very first prototype was informed by observations from a previous study from some of the authors <ref type="bibr">[25]</ref> and from our experiences using and building PBT tools. It was published at UIST 2023 as a demo <ref type="bibr">[24]</ref>, and summarized the following aspects of test data:</p><p>&#8226; Number of Unique Inputs New PBT users are sometimes surprised that their test harness produces duplicate data. Knowing how many unique inputs were tested is therefore one important signal of the test harness' efficiency. &#8226; Proportion of Valid Inputs As discussed in &#167;2.2, PBT test harnesses sometimes discard data that does not satisfy necessary preconditions. Users need to know how much of the data is discarded and how much is kept. &#8226; Size Distribution User need to keep track of the size of each individual program input used for testing. It is commonly believed in the PBT community that software can be tested well by exhaustive sets of small inputs (i.e., the small scope hypothesis [2]</p><p>), and alternatively, that large tests have a combinatorial advantage <ref type="bibr">[39]</ref> in finding more bugs. <ref type="foot">4</ref> Whichever viewpoint a tester subscribes to, it is important to know the sizes of inputs.</p><p>Analysis. Interviews were automatically transcribed by video conferencing software, <ref type="foot">5</ref> and analyzed via thematic analysis <ref type="bibr">[5]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Testing Goals and Strategies</head><p>The first result of our formative research was a clarification of PBT users' goals and strategies when they were attempting to determine the effectiveness of their tests. One might imagine that testing effectiveness could be measured by the proportion of bugs found, but this is a fantastical measure: if we had it, we would know what all the bugs are and wouldn't need to do any testing! As we found in our study sessions, developers pay attention to proxy metrics to gain confidence in their test suites. Ideally, PBT tools will surface these metrics. Here, we discuss the various metrics that developers paid attention to and how they measured them.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.1">Test Input Distribution.</head><p>Participants reported checking that their distributions covered potential edge cases like = 0, = -1, or = Integer.MAX_VALUE (P2), which are widely understood as bug-triggering values. They also checked that their distributions covered regions in the input space like = 0, &lt; 0, and &gt; 0 (P2, P4, P5); this kind of coverage is similar to notions of "combinatorial coverage" discussed in the literature <ref type="bibr">[27,</ref><ref type="bibr">49]</ref>.</p><p>Multiple participants (P1, P3, P4) wanted to know that their test data was realistic. Their justification was that the most important bugs are the ones that users were likely to hit. Another participant (P5) wanted their test data to be uniformly distributed across a space of values. They thought that this would make it easier for them to estimate the probability that there was still a bug in the program. Whether to test with realistic or uniform distributions is a topic of debate in the literature, with some tools favoring uniformity <ref type="bibr">[11,</ref><ref type="bibr">62]</ref> and others realism <ref type="bibr">[78]</ref>. In either case, developers should be able to see the shape of the distribution.</p><p>Participants used a combination of strategies to review these proxy metrics of test quality. Some (P1, P3, P5) read through the list of examples. Others (P2, P5) described using evaluation tools already present in their PBT framework of choice; one participant used events in Hypothesis and another used labels in QuickCheck, both to understand coverage of attributes of interest (e.g., how often does a particular variant of an enumerated type appear). As we show in &#167;3, while some PBT frameworks provide views of distributions of user-defined input features, they are difficult to interpret at a glance and can easily get drowned out among other terminal messages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.2">Coverage of the System Under</head><p>Test. Three participants (P2, P4, P5) mentioned coverage of the system under test (e.g., line coverage, branch coverage, etc.) was an important proxy metric. Two participants (P2, P4) reported actually measuring code coverage via code instrumentation, although P2 did point out the potential limitations of code coverage (calling it "necessary but not sufficient"). This view is supported by the literature, which suggests that coverage alone does not guarantee testing success <ref type="bibr">[47]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.3">Test Performance.</head><p>Finally, two participants (P1, P2) discussed timing performance as an important proxy for testing effectiveness. They argued that they have a limited time in which to run tests (e.g., because the tests run every time a new commit is pushed or even every time a file is saved), so faster tests (more examples per second) will exercise the system better. They measured performance with the help of tools built into the PBT framework.</p><p>Besides these metrics, participants also expressed being more confident in their tests when they understood them (P3), when they had failed previously (P3), and when a sufficiently large (for some definition of large) number of examples had been executed (P1, P4).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Design Considerations</head><p>Our formative research further clarified what is required of usable tools for understanding PBT effectiveness. We describe our learnings here as five design considerations for Tyche: Visual Feedback Our goal to provide better visual feedback from testing arose from our prior work and was validated by participants. Most participants (P1, P2, P3, P5) appreciated the interface's visual charts, stating that the visual charts are "a lot easier to digest than" the built-in statistics printed by Hypothesis (P2). The previous section ( &#167;4.2) clarifies the specific proxy metrics that developers were interested in visualizing. Workflow Integration Our initial prototype was built to have tight editor integration. It could be installed into VSCode in one step, and updated live as code changed. But, while some participants validated this choice (P1 and P2), another said they were "not always a big fan of extensions" because they use a non-VSCode IDE at work (P4). For that participant, an editor extension actually discourages use. We therefore refocused on workflow integration instead of editor integration, and re-architected our design so that it could plug into other editing workflows.</p><p>Customizability Participants found that the default set of visualizations was a good start (P1, P2, P3, P5), but they also suggested a slew of other visualizations that they thought might improve their testing evaluation. Many of these visualizations (e.g., code coverage (P1, P3, P5) and timing (P1)-see &#167;5.2) were integrated into Tyche. What we could not do was add views that summarized the interesting attributes of each person's data: every testing domain was different. Thus, tools should be customizable so developers can acquire visual feedback for the information that is important in their testing domain.</p><p>Details on Demand Almost all participants (P1, P2, P3, P4) expressed a desire to dig deeper into the visualizations they were presented. When a visualization did not immediately look as expected, participants wanted to inspect the underlying data to see where their assumptions had failed. This means that Tyche should provide ways for developers to look deeper into the details of the data that is being displayed by the visual interfaces.</p><p>Standardization Participants used PBT in multiple programming languages, including Python (P1, P2, P3, P4), Java (P4), and Haskell (P5). We posit that to improve the testing experience for all of these languages and their PBT frameworks without significant duplicated effort, Tyche needs to standardize the way it communicates with PBT frameworks. Since PBT frameworks largely implement the same test loop, despite superficial implementation differences, this standardization seems technically feasible.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">SYSTEM</head><p>In this section, we describe the design of Tyche, addressing the considerations we described in &#167;4.3. We describe the interaction model that we imagine for Tyche ( &#167;5.1), Tyche's visual displays that answer PBT users' questions ( &#167;5.2), and integrations with PBT frameworks that support easy configuration of displays ( &#167;5.3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Interaction Model</head><p>We envision user interactions with Tyche to follow roughly the steps outlined in Figure <ref type="figure">2</ref>.</p><p>(1) At the start of the loop, the developer runs their tests, and the test framework (e.g., Hypothesis) collects relevant data into an OpenPBTStats log (we discuss the details of this format in &#167;6). (2) Once the data has been logged, the user sees Tyche render an interface with a variety of visualizations (see &#167;5.2). (3) The user interacts with the interface. This may be as simple as seeing a visualization and immediately noticing that something is wrong, but they may also explore the views to seek details about surprising results or generate hypotheses about what might need to change in their test suite. If the user is happy with the quality of the test suite at this point, they may finish their testing session. (4) Finally, the user can customize their Tyche visualizations or make changes to their test (e.g., random generation strategies or Hypothesis parameters) before reentering the loop.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Run Tests to Generate</head><p>Data File</p><p>1 2 4 3 Tyche View Renders Explore Testing Effectiveness Change Test Suite, Add Events </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Visual Feedback</head><p>The Tyche interface presents the user with a novel ensemble of visualizations that are fine-tuned to the PBT setting and enriched with lightweight affordances to support exploration. We describe these visualizations in the context of the kinds of questions they answer for developers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.1">How many meaningful tests were run?</head><p>Perhaps the most important thing for a developer to know about a test run is how many meaningful examples were tested. Tyche communicates this information through the "Sample Breakdown" chart:</p><p>The chart communicates a high-level understanding of how many test inputs were sampled versus how many were "valid" to test with. Ideally, the entire chart would be taken up by the dark green "Unique / Passed" bar. If the "Invalid" bars are a large portion of the chart's height or the "Duplicate" bars are a large portion of the width, the developer can see that it might be worth investing time in a generation strategy that is better calibrated for the property at hand. (If any tests had failed, there would be two more horizontal bars with the label "Failed. ")</p><p>The use of a mosaic chart <ref type="bibr">[29]</ref> here allows Tyche to communicate information about validity and uniqueness in a single chart. We chose this chart after feedback from study participants suggested that seeing validity and uniqueness metrics separately made it hard to tell when and how they overlapped.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.2">How are test inputs distributed?</head><p>After checking the high-level breakdown of test inputs, the next questions in the user's mind will likely be about subtler aspects of the distribution of inputs used to test their property. Since test inputs are often structured objects (e.g., trees, event logs, etc.), it is difficult to observe their distribution directly: what would it even mean to plot a distribution of trees? Instead, the developer can visualize features of the distribution by plotting numerical or categorical data extracted from their test inputs.</p><p>For example, the following chart shows a distribution of sizes projected from a distribution of red-black trees:</p><p>Charts like these give developers windows into their distributions that are much easier to interpret than either the raw examples or the statistics reported by frameworks like Hypothesis: the chart above, for example, shows that the distribution skews quite small (actually, most trees are size 0!), which would likely lead to poor testing performance in practice.</p><p>Distributions for categorical features (e.g., whether the value at the root of a red-black tree is positive or not) are displayed in a different format:</p><p>Categorical feature charts can be especially useful for helping developers understand whether there are portions of the broader input space that their tests are currently missing. In this case, the developer may want to check on why so few roots are positive-in fact, it is because an empty tree does not have a positive root, and the distribution is full of empty trees! Our formative research suggested that just these two kinds of charts covered all of the kinds of projections that developers cared about. In fact, participants seemed concerned that adding more kinds of feature charts could be distracting; they felt they may waste time trying to find data to plot for the sake of using the charts, rather than plotting the few signals that would actually help with their understanding. In &#167;6.3 we describe how developers can design their own visualizations outside of Tyche if needed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.3">How did the tests execute overall?</head><p>The previous visualizations show information about test inputs, but developers may also have higher-level questions about what happened during testing. For example, early users, including formative research participants, asked for ways to visualize code coverage for their properties. Tyche provides the following coverage chart:</p><p>This Tyche chart shows the total code coverage achieved over the course of the test run. Note that this example is from a very small codebase, so there were really only a few disjoint paths to cover. Big jumps (around the 1st and 75th inputs) indicate inputs that meaningfully improved code coverage, whereas plateaus indicate periods where no new code was reached. As discussed in &#167;4, code coverage is an incomplete way to measure testing success, but knowing that the first 70+ test inputs all covered the same lines suggests that the generation strategy may spend too long exploring a particular type of input.</p><p>Tyche also provides charts with timing feedback, again answering a high-level question about execution that was requested by formative research participants:</p><p>The chart above shows that a majority of inputs execute quite quickly (less than 0.002 seconds) but that some twice or three times that. For the most expensive tests, the red area, signifying the time it takes to generate trees, is the largest. While users did request this chart, we are not clear how useful it is on its own (see &#167;7.1). However, the timing data can be used to corroborate and expand on information from other charts. For example, notice how the timing breakdown above actually mirrors the size chart from the previous section. The combination of these charts suggests that larger trees take much longer to generate, which suggests as trade-off that a developer should be aware of. The main way a user reaches the example view is by clicking on one of the selectable bars of the sample breakdown or feature distribution charts. The user can dig into the data to answer questions about why a chart looks a certain way (e.g., if they want to explore why so few of the red-black tree's root nodes are positive). Secondarily, the example view can be used to search for particular examples to make sure they appear as test inputs (e.g., important corner cases that indicate thorough testing).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Reactivity and Customizability</head><p>The visualizations provided by Tyche are reactive and customizable, allowing them to integrate neatly into the developer's workflow as dictated by our design considerations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.1">Reactivity. Reactivity has been incorporated into an astonishing variety of programming tools. It is a common feature of many modern developer tools-two modern examples are Create React</head><p>App <ref type="bibr">[13]</ref>, which reloads a web app on each source change, and pytest-watch [71], one of many testing harnesses that live-reruns tests upon code changes. When run as a VSCode extension, Tyche automatically refreshes the view when the user's tests re-run. When used in conjunction with a test suite watcher (e.g., pytestwatch, which reruns Hypothesis tests when the test file is saved) this yields an end-to-end experience with "level 3 liveness" on Tanimoto's hierarchy of levels of liveness <ref type="bibr">[84]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.2">Customizability.</head><p>In step (4) of the Tyche loop, the user can tweak their testing code in ways that change the visualizations that are shown the next time around the loop. <ref type="foot">6</ref>Assumptions. As discussed in &#167;2 with the red-black tree example, developers often express assumptions about what inputs are valid for their property. Concretely, this happens via the Hypothesis assume function; for example: def test insert lookup(t, k, v): assume(is_red_black_tree(t)) assert lookup(insert(k, v, t), k) == v</p><p>The assume function filters out any tree that does not satisfy the provided Boolean check-in this case, that the generated tree is a valid red-black tree. In the sample breakdown, inputs that break assumptions are shown as "Invalid. "</p><p>Events. Hypothesis lets programmers define custom "events" that are triggered when something interesting happened during property execution. For example, the programmer might write: if some_condition: event("hit_condition")</p><p>and then Hypothesis would output "hit_condition: 42%." To support richer visual displays of features, we extended the Hypothesis API (with the support of the Hypothesis developers) to allow events to include "payloads" that correspond to the numerical and categorical features in the feature charts above. Adding an event to the above property gives:</p><p>def test insert lookup(t, k, v): event("size", payload=size(t)) assume(is red black tree(t)) assert lookup(insert(k, v, t), k) == v</p><p>These user events correspond to feature charts: the one shown here generates the size chart shown in the previous section.</p><p>By reusing Hypothesis's existing idioms for assumptions and events, Tyche hooks into existing developer workflows and makes them more powerful.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">IMPLEMENTATION</head><p>In this section, we outline the implementation of the Tyche interface. We begin with the mechanics of the system itself ( &#167;6.1), but the most interesting part is the standardized OpenPBTStats format that PBT frameworks use to send data to Tyche ( &#167;6.2). In &#167;6.3 we explain how the Tyche architecture makes it easy to extend the ecosystem of related tools.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">UI Implementation</head><p>At the implementation level, Tyche is a web-based interface that is easy to integrate into existing PBT frameworks. The implementation can be found on GitHub. 7   6.1.1 React Application. Tyche is a React <ref type="bibr">[85]</ref> web application that consumes raw data about the results of one or more PBT runs and produces interactive visualizations to help users make sense of the underlying data. The primary way to use Tyche is in the context of an extension for VSCode that shows the interface alongside the tests that it pertains to, but it is also available as a standalone webpage to support workflow integration for non-VSCode users. 7 <ref type="url">https://github.com/tyche-pbt/tyche-extension</ref> { line_type: "example", run_start: number, property: string, status: "passed" | "failed" | "discarded", representation: string, features: {[key: string]: number | string} coverage: ..., timing: ..., ... } (When running as an extension, Tyche is still fundamentally a web application: VSCode can render web applications in an editor pane.)</p><p>The mosaic chart described in &#167;5.2.1 is implemented with custom HTML and CSS, but all other charts and visualizations are generated with Vega-Lite <ref type="bibr">[75]</ref>. Vega-Lite has good default layout algorithms for most of the types of data we care about, although it could do a better job at making edge cases like NaN obvious; we leave this for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.2">Framework Integration.</head><p>As discussed in &#167;5, we worked with the Hypothesis developers to make a few small changes to enable Tyche; other PBT tools require similar changes. The Hypothesis developers added a callback to capture data on each test run, and we implemented a simple data transformer to translate that data for Tyche. This data is printed to a file in the OpenPBTStats format, which we discuss in &#167;6.2.</p><p>In Hypothesis specifically, we also adapted the event function to have a richer API, described in &#167;5.3.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">OpenPBTStats Data Format</head><p>We designed an open standard for PBT data that helps PBT frameworks integrate Tyche and related tools.</p><p>OpenPBTStats is based on JSON Lines <ref type="bibr">[88]</ref>: each line in the file is a JSON object that corresponds to one example. An example is the smallest unit of data that a test might emit; each represents a single test case. The JSON schema in Figure <ref type="figure">3</ref> defines the format of a single example line. Each example has a run_start timestamp, used to group examples from the same run of a property and disambiguate between multiple runs of data that are stored in the same file. The property field names the property being tested and the status field says whether this example "passed" or "failed", or "discarded" meaning that the value did not pass assumptions. The representation is a human-readable string describing the example (e.g., as produced by a class's __repr__ in Python). Finally, the features contain the data collected for user-defined events.</p><p>The full format includes a few extra optional fields, including some human-readable details (e.g., to explain why a particular value was discarded), optional fields naming the particular generator that was used to produce a value, and a freeform metadata field for any additional information that might be useful in the example view. A guide to using the format can be found online. <ref type="foot">8</ref></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Expanding the Ecosystem</head><p>The clean divide between Tyche and OpenPBTStats means that PBT frameworks require only the modest work of implementing OpenPBTStats to get access to the visualizations implemented by Tyche, and conversely that front ends other than Tyche will work with any PBT tool that implements OpenPBTStats. 6.3.1 Supporting New Frameworks. Supporting a new PBT framework is as simple as extending it with some lightweight logging infrastructure. Framework developers can start small: supporting just five fields-type, run_start, property, status, and representation -is enough to enable a substantial portion of Tyche's features. After that, adding features will enable user control of visualizations; coverage and timing may be harder to implement in some programming languages, but worthwhile to support the full breadth of Tyche charts.</p><p>So far, support for OpenPBTStats exists in Hypothesis, Haskell QuickCheck, and OCaml's base-quickcheck. Our minimal Haskell QuickCheck implementation is an external library comprising about 100 lines of code and took an afternoon to write.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.2">Adding New Analyses.</head><p>Basing OpenPBTStats on JSON Lines and making each line a mostly-flat record means that processing the data is very simple. This simplifies the Tyche codebase, but it also makes it easy to process the data with other tools. For example, getting started visualizing OpenPBTStats data in a Jupyter notebook requires two lines of code: import pandas as pd pd.read_json <ref type="bibr">(&lt;file&gt;, lines=True)</ref> This means that if a developer starts out using Tyche but finds that they need a visualization that cannot be generated by adding an assumption or event, they can simply load the data into a notebook and start building their own analyses.</p><p>In the open-source community, we also expect that developers may find entirely new use-cases for OpenPBTStats data that are not tied to Tyche. For example, OpenPBTStats data could be used to report testing performance to other developers or managers (a use-case mentioned by participants in our formative research).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">EVALUATION</head><p>In this section, we evaluate Tyche. &#167;7.1 presents an online selfguided study to assess Tyche's impact on users' judgments about the quality of test suites. &#167;7.2 describes the concrete impact that Tyche has already had through identifying bugs in the Hypothesis testing framework itself.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1">Online Study</head><p>We designed this study to validate what we saw as the most critical question about the design: whether the kinds of visual feedback offered by Tyche led to improved understanding of test suites. We regarded this question as most critical because we had less confidence in the effectiveness of visual feedback for helping find bugs than in other aspects of the Tyche design-indeed, it is a tall order for any kind of feedback to provide an effective proxy for the bug-finding power of tests. (By contrast, we felt our choices around customizability, workflow integration, details on demand, and standardization were already on solid ground-these choices were more conservative, and had previously received positive feedback from developers and PBT tool builders.)</p><p>Accordingly, we designed a study to address the following research questions:</p><p>RQ1 Does Tyche help developers to predict the bug-finding power of a given test suite? RQ2 Which aspects of Tyche do users think best support sensemaking about test results?</p><p>To go beyond qualitative feedback alone, we designed the study to support statistical inference about whether we had improved judgments about test distributions. This led us a self-guided, online usability study that centered on focused usage of Tyche's visual displays. The study allowed us to collect sufficiently many responses from diverse and sufficiently-qualified programmers to support the analysis we wanted.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.1">Study</head><p>Population. We recruited study participants both from social media users on X (formerly Twitter) and Mastodon and from graduate and undergraduate students in the computer science department of a large university, aiming to recruit a diverse set of programmers ranging from relative beginners with no PBT experience to experts who may have some exposure (all participants but one were at least "proficient" in Python programming).</p><p>In all, we recruited 44 participants. 4 responses were discarded because they did not correctly answer our screening questions, leaving 40 valid responses. All but one of these reported that they were at least proficient with Python, with 12 self-reporting as advanced and 9 as expert. Half reported being beginners at PBT, 13 proficient, 6 advanced, and 0 experts. Almost all participants reported being inexperienced with the Python Hypothesis framework; only 7 reporting being proficient. To summarize, the average participant had experience with Python but not PBT, and if they did know about PBT it was often not via Hypothesis.</p><p>When reporting education level, 4 participants had a high school diploma, 15 an undergraduate degree, and 20 a graduate degree. The majority of participants (24) described themselves as students; 7 were engineers; 3 were professors; 6 had other occupations. 28 participants self-identified as male, 5 as female, 2 as another gender, and 5 did not specify.</p><p>We discuss the limitations of this sample in &#167;7.1.5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.2">Study</head><p>Procedure. We hypothesized that Tyche would improve a developer's ability to determine how well a property-based test exercises the code under test-and therefore, how likely it is to find bugs. At its core, our study consisted of four tasks, each presenting the participant with a PBT property plus three sets of sampled inputs for testing that property, drawn from three different distributions respectively. The goal of each task was to rank the distributions, in order of their bug-finding power, with the help of either Tyche or a control interface that mimicked the existing user experience of Hypothesis. Concretely, the control interface consisted of Hypothesis's "statistics" output and a list of prettyprinted test input examples; the statistics output included Hypothesis's warnings (e.g., when &lt; 10% of the sample inputs were valid). Both interfaces were styled the same way and embedded in HTML</p><p>Charts Charts Charts Dist. 2 Dist. 1 Dist. 3 Examples Examples Dist. 2 Dist. 1 Dist. 3 Examples x4 Background &amp; Instructions Sense Check Questions Evaluation Tasks Closing Questions iframes, so participants could interact with them as they would if the display were visible in their editor; Tyche was re-labeled "Charts" and the control was labeled "Examples" to reduce demand characteristics.</p><p>The distributions that participants had to rank were chosen carefully: one distribution was the best we could come up with; one was a realistic generator that a developer might write, but with some flaw or inefficiency; and one was a low-quality starter generator that a developer might obtain from an automated tool. To establish a ground truth for bug-finding power, we benchmarked each trio of input distributions using a state-of-the-art tool called Etna <ref type="bibr">[76]</ref>. Etna greatly simplifies the process of mutation testing as a technique for determining the bug-finding power of a particular generation strategy: the programmer specifies a collection of synthetic bugs to be injected into a particular bug-free program, and Etna does the work of measuring how quickly (on average) a generator is able to trigger a particular bug with a particular property. Prior work has shown that test quality as measured by mutation testing is well correlated with the power of tests to expose real faults <ref type="bibr">[43]</ref>. These ground truth measurements agreed with the original intent of the generators, with the best ones finding the most bugs, followed by the flawed ones, followed by the intentionally bad ones.</p><p>The study as experienced by the user is summarized in Figure <ref type="figure">4</ref>. We started by providing participants some general background on PBT, since we did not require that participants had worked with it before, and instructions for the study. After some "sense-check" questions to ensure that participants had understood the instructions, we presented the main study tasks. In each, the participants ranked three test distributions based on how likely they thought they were to find bugs. Each of the four tasks was focused on a distinct data structure and a corresponding property:</p><p>&#8226; Red-Black Tree The property described in &#167;2.1 about the insert function for a red-black tree implementation.</p><p>&#8226; Topological Sort A property checking that a topological sorting function works properly on directed acyclic graphs. &#8226; Python Interpreter A property checking that a simple Python interpreter behaves the same as the real Python interpreter on straight-line programs. &#8226; Name Server A property checking that a realistic name server <ref type="bibr">[89]</ref> behaves the same as a simpler model implementation.</p><p>These tasks were designed to be representative of common PBT scenarios: red-black trees are a standard case study in the literature <ref type="bibr">[74,</ref><ref type="bibr">76]</ref>, topological sort has been called an ideal pedagogical example for PBT <ref type="bibr">[64]</ref>, programming language implementations are a common PBT application domain <ref type="bibr">[69]</ref>, and name servers are a kind of system that is known to be difficult to test with PBTspecifically, systems with significant internal state <ref type="bibr">[40]</ref>.</p><p>To counterbalance potential biases due to the order that different tasks or conditions were encountered, we randomized the experience in three ways: (1) two tasks were randomly assigned Tyche, while the other two received the control interface, (2) tasks were shown to users in a random order, and (3) the three distributions for each task were arranged in a random order.</p><p>Four participants took over an hour to complete the study; we suspect this is because they started, took a break, and then returned to the study. Of the rest, participants took 32 minutes on average ( = 12) to complete the study; only one took less than 15 minutes. Participants took about 3 minutes on average ( = 2.5) to complete each task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>7.1.3</head><p>Results. To answer RQ1, whether or not Tyche helps developers to predict test suite bug-finding power, we analyzed how well participants' rankings of the three distributions for each task agreed with the true rankings as determined by mutation testing. Given a participant's ranking, for example 2 &gt; 1 &gt; 3 , we compared it to the true ranking (say, 1 &gt; 2 &gt; 3 ) by counting the number of correct pairwise comparisons-here, for example, the participant correctly deduced that 1 &gt; 3 and 2 &gt; 3 , but they incorrectly concluded that 2 &gt; 1 , so this counted as one incorrect comparison. <ref type="foot">9</ref>Figure <ref type="figure">5</ref> shows the breakdown of incorrect comparisons made with and without Tyche, separated out by task. To assess whether Tyche impacted correctness, we performed a one-tailed Mann-Whitney U test <ref type="bibr">[59]</ref> for each task, with the null hypothesis that Tyche does not lead to fewer incorrect comparisons. The results appear in Table <ref type="table">1</ref>. For three of the four tasks (all but Python Interpreter), participants made significantly fewer incorrect comparisons when using Tyche, with strong common language effect sizes, meaning that participants were better at assessing testing effectiveness with Tyche than without. Furthermore, a majority of participants got a completely correct ranking for all 4 of the tasks with Tyche, while this was only the case for 1 of the tasks without Tyche. (For Python Interpreter, participants overwhelmingly found the correct answer</p><p>0 1 2 3 0 5 10 15 Count Red-Black Trees 0 1 2 3 Topological Sort 0 1 2 3 Python Interpreter 0 1 2 3 Name Server Condition Charts Examples Incorrect Comparisons with both conditions-in other words, the task was simply too easybut precisely why it was too easy is interesting; see &#167;7.1.4.) Despite this difference in accuracy, participants took around the same time with both treatments; the mean time to complete a task with Tyche was 183 seconds ( = 125), verses 203 seconds ( = 165) for the control. These results support answering RQ1 with "yes," Tyche helps users more accurately predict bug-finding power.</p><p>To answer RQ2 we used a post-study survey, asking participants for feedback on which of Tyche's visualizations they found useful. The vast majority of participants (37/40) stated that Tyche's "bar charts" were helpful. (Unfortunately, we phrased this question poorly: we intended for it to refer only to feature charts, but participants may have interpreted it to include the mosaic chart as well.) Additionally, 20/40 participants found the code coverage visualization useful, 17/40 found the warnings useful, and 14/40 found the listed examples useful. Only 4/40 found the timing breakdown useful; we may need to rethink that chart's design, although it may also simply be that the tasks chosen for the study did not require timing data to complete. These results suggest that the customizable parts of the interface-the feature and/or mosaic charts-were the most useful, followed by other affordances.</p><p>To get a sense of participants' overall impression of Tyche, we also asked "Which view [Tyche or the control] made the difference between test suites clearer?" with five options on a Likert scale. All but one participant said Tyche made the differences clearer, with 35/40 saying Tyche was "much clearer" (the maximum on the scale). Confidence. Alongside each ranking, we asked developers how confident they were in it, on a scale from 1-5 ("Not at all" = 1, "A little confident" = 2, "Moderately confident" = 3, "Very confident" = 4, "Certain" = 5). We found that reported confidence was significantly higher with Tyche than without on two tasks (Red-Black Tree and Topological Sort), as computed via a similar one-sided Mann-Whitney U test to the one before ( &lt; 0.01 and = 0.03 respectively), with no significant difference for the other tasks. However, confidence ratings should be viewed with some skepticism. When we computed <ref type="bibr">Spearman's [79]</ref> between the confidence scores and incorrect comparison counts, we found no significant relationship; in other words, participants' confidence was not, broadly, a good predictor of their success.</p><p>Non-significant Result for "Python Interpreter" Task. As mentioned above, the Python Interpreter task seems to have been too easy; participants made very few mistakes across the board. We propose that this is, at least in part, because the existing statistics output available in Hypothesis were already good enough. For the worst of the three distributions, Hypothesis clearly displayed a warning that "&lt; 10% of examples satisfied assumptions," an obvious sign of something wrong. Conversely, for the best distribution of the three, Hypothesis showed a wide variety of values for the variable_uses event, which was only ever 0 for the other two distributions. Critically, the list displayed was visually longer, so it was easy to notice a difference a glance. (We show an example of what the user saw in Appendix B.) This result shows that Hypothesis's existing tools can be quite helpful in some cases: in particular, they seem to be useful when the distributions have big discrepancies that make a visual difference (e.g., adding significant volume) in the statistics output. 7.1.5 Limitations. We are aware of two significant limitations of the online study: sampling bias and ecological validity.</p><p>The sample we obtained under-represents important groups with regards to both gender and occupation. For gender, prior work has shown that user interfaces often demonstrate a bias for cognitive strategies that correlate with gender <ref type="bibr">[9,</ref><ref type="bibr">82]</ref>, so a more genderdiverse sample would have been more informative for the study.</p><p>For occupation, we reached a significant portion of students and proportionally fewer working developers. Many of those students are in computer science programs and therefore will likely be developers someday, but software developers are ultimately the population we would like to impact so we would like to have more direct confirmation that Tyche works for them.</p><p>The other significant limitation is ecological validity. Because this study was not run in situ, aspects of the experimental design may have impacted the results. For example, study participants did not write the events and assumptions for the property themselves; this means our outcomes assume that the participants could have come up with those events themselves in practice. Additionally, participants saw snippets of code, but they were not intimately familiar with, nor could they inspect, the code under test. In a real testing scenario, a developer's understanding of their testing success would depend in part on their understanding of the code under test itself. We did control for other ecological issues: for example, we used live instances of Tyche in an iframe to maintain the interactivity of the visual displays, and we developed tasks that spanned a range of testing scenarios. We discuss plans to evaluate Tyche in situ in &#167;8.1</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2">Impact on the Testing Ecosystem</head><p>Since Tyche is an open-source project that is beginning to engage with the PBT community, we can also evaluate its design by looking at its impact on practice. The biggest sign of this so far is that Tyche has led to 5 concrete bug-fixes and enhancements in the Hypothesis codebase itself. As of this writing, Hypothesis developers have found and fixed three bugs-one causing test input sizes to be artificially limited, another that badly skewed test input distributions, and a third that impacted performance of stateful generation strategies-and two long-standing issues pertaining to user experience: a nine-year-old issue about surfacing important feedback about the assume function and a seven-year-old issue asking to clarify terminal error messages. All five issues are threats to developers' evaluation of their tests. They were found and fixed when study participants and other Tyche users noticed deficiencies in their test suites that turned out to be library issues.</p><p>The ongoing development of Tyche has the support of the Hypothesis developers, and it has also begun to take root in other parts of the open-source testing ecosystem. One of the authors was contacted by the developers of PyCharm, an IDE focused on Python specifically, to ask about the OpenPBTStats format. They realized that the coverage information therein would provide them a shortcut for code coverage highlighting features that integrate cleanly with Hypothesis and other testing frameworks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">CONCLUSIONS AND FUTURE WORK</head><p>Tyche rethinks the PBT process as more interactive and empowering to developers. Rather than hide the results of running properties, which may lead to confusion and false confidence, the OpenPBT-Stats protocol and interfaces like Tyche give developers rich insight into their testing process. Tyche provides visual feedback, integrates with developer workflows, provides hooks for customization, shows details on demand, and works with other tools in the ecosystem to provide a standardized way to evaluate testing success.</p><p>Our evaluation shows that Tyche helps developers to tell the difference between good and bad test suites; its demonstrated real-world impact on the Hypothesis framework confirms its value.</p><p>Moving forward, we see a number of directions where further research would be valuable.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1">Evaluation in Long-Term Deployments</head><p>Our formative research and online evaluation study have provided evidence that Tyche is usable, but there is more to explore. For one thing, we would like to get in-situ empirical validation for the second half of the loop in Figure <ref type="figure">2</ref>. As Tyche is deployed over longer periods of time in real-world software development settings, we are excited to assess its usability and continued impact.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.2">Improving Data Presentation for Tyche</head><p>As the Tyche project evolves, we plan to add new visualizations and workflows to support developer exploration and understanding.</p><p>Code Coverage Visualization. The visualization we provide for displaying code coverage over time was not considered particularly important by study participants: it may be useful to explore alternative designs or cut that feature entirely.</p><p>One path forward is in-situ line-coverage highlighting, like that provided by Tarantula <ref type="bibr">[42]</ref>. Indeed, it would be easy to implement Tarantula's algorithm, which highlights lines based on the proportion of passed versus failed tests that hit that line in Tyche (supported by OpenPBTStats). In cases where no failing examples are found, each line could simply be highlighted with a brightness proportional to the number of times it was covered. <ref type="foot">10</ref>Line highlighting is can answer some questions about particular parts of the codebase, but developers may also have questions about how code is exercised for different parts of the input space. To address these questions, we plan to experiment with visualizations that cluster test inputs based on the coverage that they have in common. This would let developers answer questions like "which inputs could be considered redundant in terms of coverage?" and "which inputs cover parts of the space that are rarely reached?" Mutation Testing. In cases where developers implement mutation testing for their system under test, we propose incorporating information about failing mutations into Tyche for better interaction support. Recall that in &#167;7.1, we used mutation testing, via the Etna tool, as a ground truth for test suite quality; mutation testing checks that a test suite can find synthetic bugs or "mutants" that are added to the test suite. Etna is powerful, but its output is not interactive: there is no way to explore the charts it generates, nor can you connect the mutation testing results with the other visualizations that Tyche provides. Thus, we hope to add optional visualizations to Tyche, inspired by Etna, that tell developers how well their tests catch mutants.</p><p>Longitudinal Comparisons of Testing Effectiveness. Informal conversations with potential industrial users of Tyche suggest that developers want ways to compare visualizations of test performance for the same system at different points in time-either short term, to inspect the results of changes-or longer term, to understand how testing effectiveness has evolved over time. These comparisons would make it clear if changes over time have improved test quality, or if there have been significant regressions.</p><p>Interestingly, the design of the online evaluation study accidentally foreshadowed a design that may be effective: allowing two instances of Tyche, connected to different instances of the system under test, to run side-by-side so the user can compare them. Since developers were able to successfully compare two distributions side-by-side with Tyche in the study, we expect they will also be able to if presented the same thing in practice. This is simple to implement and provides good value for developers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.3">Improving Control in Tyche</head><p>Tyche is currently designed to support existing developer workflows and provide insights into test suite shortcomings. But participants in the formative research (P1, P4) did speculate about some ways that Tyche could help developers to adjust their random generation strategies after they notice something is wrong.</p><p>Direct Manipulation of Distributions. When a developer notices, with the help of Tyche, that their test input distribution is subpar, they may immediately know what distribution they would prefer to see. In this case, we would like developers to be able to change the distribution via direct manipulation-i.e., clicking and dragging the bars of the distribution to the places they should be, automatically updating the input generation strategy accordingly. One potential way to achieve this would be to borrow techniques from the probabilistic programming community, and in particular languages like Dice <ref type="bibr">[38]</ref>. Probabilistic programming languages and random data generators are quite closely related, but the potential overlap is under-explored. Alternatively, reflective generators <ref type="bibr">[26]</ref> can tune a PBT generator to mimic a provided set of examples. If a developer thinks a particular bar of a chart should be larger, a reflective generator may be able to tune a generator to expand on the examples represented in that bar.</p><p>Manipulating Strategy Parameters in Tyche. Occasionally direct manipulation as discussed above will be computationally impossible to implement; in those cases Tyche could still provide tools to help developers easily manipulate the parameters of different generation strategies. For example, if a generation strategy takes a max_value as an input, Tyche could render a slider that lets the developer change that value and monitor the way the visualizations change, resembling interactions already appearing in HCI programming tools (e.g., <ref type="bibr">[30,</ref><ref type="bibr">46]</ref>). Of course, running hundreds of tests on every slider update may be slow; to speed it up, we propose incorporating ideas from the literature of self-adjusting computation <ref type="bibr">[1]</ref>, which has tools for efficiently re-running computations in response to small changes of their inputs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.4">Tyche Beyond PBT</head><p>The ideas behind Tyche may also have applications beyond the specific domain of PBT. Other automated testing techniques-for example fuzz testing ("fuzzing")-could also benefit from enhanced understandability. Fuzzing is closely related to PBT, <ref type="foot">11</ref> and the fuzzing</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>Some approaches to PBT use exhaustive enumeration<ref type="bibr">[74]</ref> or guided search<ref type="bibr">[51]</ref> instead of hand-written input generators; these lead to different usability trade-offs, but ultimately results should always be reviewed to ensure that testing was successful.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>P5 showed us older code that they no longer had the infrastructure to run, so they only saw Tyche running on our examples.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2"><p>Each of these viewpoints seems to be correct in some situations; a recent study<ref type="bibr">[76]</ref> has a nice discussion.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3"><p>P3's interview audio was lost due to technical difficulties, so we instead analyzed the notes we took during their interview.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_4"><p>While Tyche works with many PBT frameworks, we describe these customizations in detail for Python's Hypothesis specifically. Other frameworks may choose to implement user customization in other ways that are more idiomatic for their users.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_5"><p>See<ref type="bibr">[31]</ref>. Some field names have been changed to clarify the explanations in the paper.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_6"><p>This metric is isomorphic to Spearman's<ref type="bibr">[79]</ref> in this case. Making 0 incorrect comparisons equates to = 1, making 1 is = 0.5, 2 is = -0.5, and 3 is = -1. We found counting incorrect comparisons to be the most intuitive way of conceptualizing the data.<ref type="bibr">9</ref> This corresponds to the probability that randomly sampled Tyche participant will make fewer errors than a control participant, computed as = 1 /( 1 * 2 ).</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_7"><p>Unit-test frameworks could also report simple OpenPBTStats output (see &#167;6.3.1) with one line per example-based test, enabling per-test coverage visualization for almost any test suite.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_8"><p>Generally speaking, fuzzers operate on whole programs and run for extended periods of time, whereas PBT tools operate on smaller program units and run for shorter times. Instead of testing logical properties, fuzzers generally try to make the program crash. community has some interesting visual approaches to communicating testing success. One of the most popular fuzzing tools, AFL++<ref type="bibr">[18]</ref>, includes a sophisticated textual user-interface giving feedback on code coverage and other fuzzing statistics over the course of (sometimes lengthy) "fuzzing campaigns." But current fuzzers suffer from the same usability limitations as current PBT frameworks, hiding information that could help developers evaluate testing effectiveness. We would like to explore adapting Tyche and expanding OpenPBTStats to work with fuzzers and other automated testing tools, bringing the benefits of our design to an even broader audience.</p></note>
		</body>
		</text>
</TEI>
