<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>ICoq: Regression proof selection for large-scale verification projects</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>10/01/2017</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10055461</idno>
					<idno type="doi">10.1109/ASE.2017.8115630</idno>
					<title level='j'>International Conference on Automated Software Engineering</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Ahmet Celik</author><author>Karl Palmskog</author><author>Milos Gligoric</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Proof assistants such as Coq are used to construct and check formal proofs in many large-scale verification projects. As proofs grow in number and size, the need for tool support to quickly find failing proofs after revising a project increases. We present a technique for large-scale regression proof selection, suitable for use in continuous integration services, e.g., Travis CI. We instantiate the technique in a tool dubbed ICOQ. ICOQ tracks fine-grained dependencies between Coq definitions, propositions, and proofs, and only checks those proofs affected by changes between two revisions. ICOQ additionally saves time by ignoring changes with no impact on semantics. We applied ICOQ to track dependencies across many revisions in several large Coq projects and measured the time savings compared to proof checking from scratch and when using Coq's timestamp-based toolchain for incremental checking. Our results show that proof checking with ICOQ is up to 10 times faster than the former and up to 3 times faster than the latter.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Verification projects based on construction and certification of formal proofs inside proof assistants have reached a hitherto unprecedented scale. Large projects take two main forms: formalizations of mathematical theories and programs with accompanying proofs of correctness at the level of executable code <ref type="bibr">[23]</ref>. The former includes the proofs of the four-color theorem <ref type="bibr">[27]</ref> and the Feit-Thompson odd order theorem in Coq <ref type="bibr">[28]</ref>, and a proof of the Kepler conjecture in HOL Light <ref type="bibr">[29]</ref>; the latter includes the certified seL4 operating system kernel in Isabelle/HOL <ref type="bibr">[35]</ref>, and the CompCert C compiler in Coq <ref type="bibr">[38]</ref>.</p><p>Using proof assistants has advantages with respect to scalability, modularity, and reliability compared to using more automated methods based only on model checking or SMT solving <ref type="bibr">[23]</ref>. On the other hand, proof assistants are more human resource intensive to use than model checkers, and come with less tool support than what is available to programmers using mainstream programming languages. Specifically, Wenzel has recently noted the need for more systematic tool support to maintain repositories of formal proofs <ref type="bibr">[61]</ref>.</p><p>Large verification projects based on proof assistants are similar to regular software projects in that (a) the end goal is a software artifact with certain properties, (b) developers use an integrated development environment (IDE) to write code, which is then checked by a tool and submitted to a version control system shared with others. Evidence from earlier undertakings indicate that such projects require engineering effort similar to, or beyond, some of the most complex software projects; for example, the proof of the odd order theorem in Coq was a six-year effort of a team of 15 people, resulting in 170,000 lines of code <ref type="bibr">[40]</ref>.</p><p>We believe that proper tool support for large-scale proof engineering using proof assistants is an important and growing concern <ref type="bibr">[34]</ref>. In particular, it is important to quickly find and report errors in evolving Coq and Isabelle/HOL projects. However, just as for large projects in, e.g., Java, determining the errors caused by a particular change can be a timeconsuming process. For instance, the Coq correctness proofs of an implementation of the Raft distributed consensus protocol <ref type="bibr">[41]</ref> are around 50k lines in total <ref type="bibr">[63]</ref> and take more than 30 minutes to check from scratch on a computer with an Intel Core i7 4th generation processor. Potentially, a Coq user has to wait all this time to find out whether a change in some definition makes a seemingly unrelated proof fail.</p><p>Until recently, all proof assistants in the LCF family, including Isabelle/HOL and Coq, relied on user interaction through a read-eval-print loop inherited from their predecessor. This interaction model effectively prevents event-based user interaction with proof assistant files inside an IDE, in the style of Eclipse. Initial work in Isabelle/HOL to address this problem <ref type="bibr">[60]</ref> paved the way for recent architectural changes in Coq towards a document-oriented interaction model, where the proof assistant backend asynchronously receives definitions, proof commands, and proof checking tasks from the user, all of which may concern disparate parts of a project <ref type="bibr">[7]</ref>.</p><p>In this paper, we show that potential gains in productivity from Coq's new interaction model go beyond recent application inside IDEs <ref type="bibr">[21]</ref>. We present ICOQ, a tool for regression proof selection for large-scale Coq projects, suitable for use in workflows involving version control and continuous integration services (CISs), e.g., Travis CI <ref type="bibr">[31]</ref>, <ref type="bibr">[53]</ref>. (CISs run tests/proofs of a project whenever code of the project changes. These services have become widely used; Travis CI, one out of more than 20 available CISs, is used by more than 300k projects <ref type="bibr">[33]</ref>.) ICOQ works by tracking dependencies between definitions, propositions, and proofs. When presented with a set of changes to Coq files, ICOQ uses this knowledge of dependencies to only check the proofs affected by the changes, potentially saving significant time in comparison to checking everything from scratch. In addition, ICOQ saves time by ignoring changes with no impact on the semantics of files, e.g., additions of comments or whitespaces.</p><p>Our approach is based on a fundamental analogy between tests and proofs. As Beck has noted in context of extreme programming <ref type="bibr">[9]</ref>, a test can be viewed as a method that checks a partial functional specification of a system. Consequently, a proposition about a (pure) function in Coq's logic, along with its proof, can be viewed as an amalgamation of manypossibly an infinite number of-tests. For example, changing the definition of a function in a Coq file can potentially impact many proofs, analogously to how changes in Java programs affect tests in a test suite. Using this analogy, ICOQ mirrors previous work in regression testing for mainstream programming languages, in particular techniques for lightweight regression test selection, which have been shown to significantly lower the cost of running test suites, and hence find errors more quickly <ref type="bibr">[10]</ref>, <ref type="bibr">[18]</ref>, <ref type="bibr">[24]</ref>, <ref type="bibr">[37]</ref>, <ref type="bibr">[42]</ref>, <ref type="bibr">[43]</ref>, <ref type="bibr">[46]</ref>, <ref type="bibr">[47]</ref>, <ref type="bibr">[52]</ref>, <ref type="bibr">[64]</ref>. Such tools have recently been adopted by many large open-source Java projects. ICOQ opens the door for similar benefits to accrue to developers of large Coq projects.</p><p>Nevertheless, proofs and tests are also different in several important ways. First, the proof of one claim typically depends on other claims; tests are typically completely independent of other tests. Second, function definitions, claims, and proof scripts are often interspersed in Coq files; test code is seldom interspersed with program code. Third, Coq proof checking is done in the same environment as the processing of definitions and even computation; executing tests is usually done completely separately from code compilation. We overcome these three challenges by leveraging Coq's newly-added toolchain for asynchronous proof processing <ref type="bibr">[7]</ref>.</p><p>To evaluate ICOQ, we applied it on the version control histories of several Coq developments, including three largescale projects, and measured the time savings compared to proof checking from scratch (typical use in continuous integration systems) and incremental proof checking using Coq's timestamp-based toolchain (typical command-line use). Our results show that processing proofs with ICOQ is up to 10&#215; faster than the former, and up to 3&#215; faster than the latter.</p><p>We make the following contributions:</p><p>&#8902; Technique: We propose regression proof selection (inspired by regression test selection), a technique that can substantially reduce proof checking time for evolving verification projects. To the best of our knowledge, this is the first application of research in regression testing to the domain of formal proofs. Our insight is that due to simpler language features in proof assistants than in imperative languages (e.g., Java), regression proof selection can straightforwardly collect fine-grained dependencies, which are used to identify proofs to recheck at each project revision.</p><p>&#8902; Tool: We implemented regression proof selection in a tool, dubbed ICOQ, which supports Coq projects. We provide a version of our tool on the following URL: <ref type="url">http://cozy.ece</ref>. utexas.edu/icoq.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8902; Evaluation:</head><p>We performed an empirical study to measure the effectiveness (in terms of both number of executed proofs and proof checking time) of regression proof selection using ICOQ. We used several open-source Coq projects, including three large-scale projects.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. COQ BACKGROUND</head><p>The Coq proof assistant can be viewed as, on the one hand, a small and powerful purely functional programming language, and on the other hand, a system for specifying properties about programs and proving them. Coq is based on a constructive type theory called the Calculus of Inductive Constructions (CIC) <ref type="bibr">[44]</ref>. In CIC, both programs and propositions about programs are types inhabited by terms, in effect putting program construction and proving on the same footing. Via a frontend, e.g., emacs with Proof General <ref type="bibr">[5]</ref>, a user interactively constructs tentative proof terms for propositions (assertions) using operations called tactics, and the final result is only accepted after Coq's type checker was run successfully by the backend on the term. Barring use of inconsistent axioms and frontend issues, the user need only trust that the comparatively small type checking kernel is correctly implemented and compiled to trust the results. The interactive proof development process in Coq is illustrated in Figure <ref type="figure">1</ref>. Definitions of functions and lemmas processed by Coq are written in the Gallina language, and reside in files ending in .v. The standard Coq batch proof processing ("compilation") tool, coqc, takes a .v file as input and produces a .vo file as output that contains full binary representations of processed Gallina constructs, including proofs. If the proofs are large and complex, .vo files can be tens of megabytes large <ref type="bibr">[39]</ref>. Since files may depend on other files, checking all proofs in a Coq project requires some form of dependency analysis. The standard coq_makefile tool generates a Makefile which, by default, calls the coqdep tool for this purpose <ref type="bibr">[16]</ref>. coqdep builds a dependency graph for all input files based on simple syntactic analysis of Require commands (similar to import statements in Java) in files, which indicate direct dependency at the file level. When proof checking is then performed via the Makefile, the generated dependency graph is used to compile .v files in some allowed order, possibly in parallel. The generated Makefile also enables timestamp-based incremental processing of Coq projects, which is known to be limited <ref type="bibr">[19]</ref>, <ref type="bibr">[25]</ref>.</p><p>Figure <ref type="figure">2</ref> shows the content of three example Gallina files, where a simple function on lists of natural numbers is defined, specified, and proved correct. Alternate.v contains definitions used in the two other files, and these </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Fig. 2. Coq Gallina file examples</head><p>dependencies are found by coqdep. The dependency data is used to generate a Makefile that calls coqc to produce .vo files; if Alternate.v is subsequently modified in any way after compilation, the other files will also be automatically recompiled when running make. On the other hand, modification of the other files does not trigger recompilation of Alternate.v.</p><p>In effect, the coqdep tool produces a coarse-grained dependency graph of a Coq development at the level of .v files, as shown in Figure <ref type="figure">3</ref>(a) for the example Gallina files; dashed arrows indicate dependencies on files from Coq's standard library, which are usually disregarded. Internally, Coq maintains a fine-grained dependency graph at the level of constants, reminiscent of the graph shown in Figure <ref type="figure">3(b)</ref>.</p><p>In each Coq file, the commands between Proof. and Qed. are proof scripts comprised of tactic calls along with bullets to indicate goal structure. Proof scripts instruct Coq how to build a proof term. Tactics can be pipelined and may perform sophisticated and time-consuming search operations, splitting of goals, and term rewriting. Ultimately, tactics produce a proof t in Coq's term syntax, of which a fragment is shown in Figure <ref type="figure">4</ref>. For example, the beginning of the proof of alt_alternate can be represented as Const(Lambda(l1, App(list, nat), App(list_ind, . . . )))</p><p>where list and nat are the Ind terms for the algebraic datatypes for polymorphic lists and natural numbers, respectively, and list_ind is the Const term for a list induction principle.</p><p>Coq version 8.5, the first stable release to include architectural changes to support a document-oriented interaction model <ref type="bibr">[7]</ref>, introduced the option to quick-compile .v files to the binary .vio format, a process which avoids checking (and emitting representations of) proofs that have been indicated as opaque by ending with Qed. Only the type (assertion) of an opaque identifier such as alt_alternate, i.e., not the body term, can be referenced in other parts of a Coq development, whence type checking of all such terms can normally be performed in complete isolation. Specifically, .vio files contain proof-checking tasks, which can be performed individually by issuing a coqc command referencing the task identifier. A Coq user can depend on more rapidly produced .vio files in lieu of .vo files in most developments, but must then assume that all proofs are correct.</p><p>For example, the lemma alternate_correct (AlternateLem.v) in the Coq development in Figure <ref type="figure">2</ref> depends on the types (assertions) of alternate_alt and alt_alternate, but not their proofs; consequently, the proof of alternate_correct need not be re-checked if only the proof of alt_alternate is changed. In this case, the sole required action is to re-check the proof of alt_alternate, which can be accomplished by first quick-compiling Alternate.v and then running the single proof-checking task in Alternate.vio. Figure <ref type="figure">5</ref> illustrates the possible workflows for Alternate.v made possible by Coq's document-oriented model.</p><p>Coq uses a notion of sections to organize common assumptions made in a collection of lemmas, say, that equality on type A is decidable (A_eq_dec). A lemma may reference one or more such assumptions, which then become quantified variables that must be instantiated when the lemma is referenced outside of the section. However, by default, Coq only determines the used section variables of a lemma when the end of the section is reached. This means that the final type (assertion) of the section lemma is not known when considered in isolation, whence its proof cannot be immediately checked as an asynchronous task. To get around this problem, Coq allows section lemmas to be annotated with the assumptions they use (e.g., Proof using A_eq_dec). The required annotations can be derived from metadata produced by Coq during compilation of source files to .vo files <ref type="bibr">[51]</ref>, and then inserted back into the source files. In the evaluation of our technique, we used this approach to add annotations to all revisions of the projects under study as a separate initial step.  III. TECHNIQUE This section describes our proof selection technique. We first describe its phases at a high level, then details on the lower-level steps, and finally our implementation in the ICOQ tool. The key idea is to incrementally build and analyze both coarse-grained and fine-grained dependency graphs to produce the minimal set of proofs that need to be checked after a change has been made to a project. The advantage of our technique compared to the timestamp-based incremental processing of files stems from that, generally, checking a few proofs in isolation spread out across a development takes much less time and effort than checking all proofs in all affected source files.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Phases</head><p>Roughly, our technique follows the three phases of a typical regression test selection technique <ref type="bibr">[64]</ref>: an initial analysis phase that locates proofs affected by recent changes, followed by an execution phase that checks the selected proofs, followed by a final collection phase which produces dependencies for the next revision. We assume that both the file-level and lemma-level dependencies and checksums of the last revision of the project are available at the start of the initial phase. Analysis phase: First, for each source file in the project, we check whether its checksum is still the same since the last revision. Then, we perform file-level dependency analysis and build an up-to-date coarse-grained dependency graph that includes checksums, with changed files marked. This graph is then used to quick-compile the changed source files, allowing us to compute checksums of the term representations of individual definitions and lemma statements that may have changed. At the same time, we also determine the proof tasks available in each changed source file, and compute the checksum of each proof script associated with a proof task. Using our knowledge of proof tasks and checksums for finegrained entities, we obtain a fine-grained dependency graph where each modified entity is marked, and from which recently removed entities are purged.</p><p>By going through all modified entities in the fine-grained dependency graph, we then calculate the transitively impacted entities, and mark them in the graph. The set of proof tasks to execute is then precisely the tasks associated with the set of modified and impacted entities. Note that this process of discovering impacted proofs is similar to the process of "invalidating the upward transitive closure" in some build systems, e.g., Bazel <ref type="bibr">[8]</ref>. Execution phase: Given the list of proof tasks and their associated source files and binary quick-compiled files from the previous phase, we emit the commands for checking those tasks. After each command is executed, we note the dependencies of the proof on other lemmas and definitions; this information is only available when the proof term has actually been constructed and stored in memory. Collection phase: This phase finds the dependencies of all modified definitions and lemmas by extracting them from the quick-compiled files and combining the results with the proof dependencies obtained in the previous phase. We use these dependencies to build a complete up-to-date fine-grained dependency graph that includes checksums. We then store this graph as a file, to be used in the analysis phase of the next project revision.</p><p>Running example: We exemplify our technique for Coq using the code in Figure <ref type="figure">2</ref>. Assume that we integrated ICOQ in the project at revision v1. At that revision, we compute the checksums of all .v files, run coqdep on them, and build the graph shown in Figure <ref type="figure">3(a)</ref>; no checksums existed in revisions prior to v1 and therefore the current values are considered different by definition. Since all file checksums are different, we quick-compile all files into .vio files and compute all the checksums for all definitions and lemma statement terms. Then, we note the proof tasks in each file and compute checksums for the associated proof scripts. Again, all checksums are different by definition, so we check the proofs of all lemmas (alt_alternate, alt_exists, alternate_alt, and alternate_correct). From the corresponding proof terms, and the terms for alternate and alt, we construct the graph in Figure <ref type="figure">3</ref>(b) and add checksums for all nodes. The graphs and checksums are then stored for future use. Suppose that the developer of the example Coq project rewrites the definition of the function alternate to the one in Figure <ref type="figure">6</ref>; this change leads to a new revision v2 of the project. At the file level, the checksum of Alternate.v becomes different from before. However, coqdep reveals that the file dependency graph is still the same as in Figure <ref type="figure">3(a)</ref>. Since the other .v files depend on Alternate.v, we compile all .v files into .vio files in some order allowed by the graph. After then computing checksums of terms (using Alternate.vio) and proof scripts (using Alternate.v), we conclude that only (the body of) alternate has been modified. Using this information and the graph in Figure <ref type="figure">3(b)</ref>, we determine that the proofs of alt_alternate, alternate_alt, and alternate_correct are impacted and must be checked. Consequently, we run the commands to check these proofs (while alt_exists is not checked, because it was not impacted).</p><p>After each proof checking task has completed, we note that no dependencies in the proofs have changed. Finally, we extract and analyze dependencies from the only modified nonproof term (alternate), confirming that the graph in </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. ICOQ Components and Workflow</head><p>Our current implementation of the technique is written in OCaml, Java, and bash. We developed a number of separate Coq tools and plugins. Since Coq developments are not upwards or downwards compatible in general, we target Coq version 8.5 to support the largest range of project revision histories susceptible to asynchronous proof checking; we expect no fundamental issues with supporting 8.6 and future Coq versions. Our tools and plugins can also be used (and be useful) outside the context of ICOQ.</p><p>coq-depends plugin: To extract dependencies from compiled Coq files (.vo and .vio), we adapted and extended previous work on the coq-dpdgraph Coq plugin [2], which builds dependency graphs for given identifiers or modules (files). In essence, the derived plugin, called coq-depends, traverses a Coq term abstract syntax tree (AST), and records the globally unique ("kernel") name of all referenced identifiers it encounters, such as those of inductive types, lemmas, and functions. By performing the dependency extraction at the level of ASTs in the Coq backend, our tool is isolated from complexities at the Gallina level, such as custom notations and implicit arguments. In contrast to coq-dpdgraph, coq-depends does not perform recursive dependency extraction, and supports .vio files, which do not contain the proofs of opaque identifiers that coq-dpdgraph expects to be present. The plugin makes no distinction between depending on an identifier of a lemma or function that is inside the scope of a project or outside it. In particular, if there is a dependency on a lemma in the Coq standard library, which is normally assumed to be stable across revisions, it must be filtered out from the plugin output to be excluded from analysis. For example, from the proof term for the lemma alt_alternate described in section II, coq-depends extracts the set of kernel names {Alternate.alt, Alternate.alternate, Coq.Init.Datatypes.list, Coq.Init.Datatypes.list_ind, . . . }.</p><p>Here, to filter out unnecessary dependencies, it suffices to exclude names with the prefix "Coq.".</p><p>coq-ast plugin: To compare Coq identifiers across project revisions, we developed a plugin for computing short summaries (digests) of Coq term ASTs that capture the structure of the trees. We use a technique for computing summaries based on cryptographic hashes that was shown to be effective at programming language syntax fingerprinting by Chilowicz et al. <ref type="bibr">[14]</ref>. More specifically, letting C be a hashing function, &#8226; the string concatenation operation, t a term AST with root node r and child trees t 1 , . . . ,t n , and V a function from AST nodes to strings, Chilowicz et al. define a hash function</p><p>Note that this function, which we implemented in OCaml with C = MD5, is linear in the number of nodes in the tree.</p><p>The function V is defined in an obvious way based on the syntax in Figure <ref type="figure">4</ref>; as an example, Figure <ref type="figure">7</ref> shows a fragment of the AST of the proof of alt_alternate in Alternate.v where V has been applied to each node. To keep ASTs as shallow as possible, we do not unfold bodies of referenced inductive types or constants, and simply use their (unique) kernel names.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Const</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lambda</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Name l1</head><p>App App</p><p>Ind list Ind nat Const list_ind . . . coqdigest tool: Since we cannot compute digests of ASTs of opaque identifiers without actually performing all the proof-checking work (that we are trying to skip), we use digests of the actual proof scripts ("tactic soups") in the .v files. From the standard coqdoc tool which translates .v files into documentation, we derived a tool dubbed coqdigest that extracts the proof scripts of opaque lemmas while ignoring sequences of characters that do not affect semantics, and returns the MD5 hash of the results. The tool also notes whether a lemma is admitted, i.e., whether an identifier with an unfinished proof is assumed as complete for the rest of the development; this is a common device used in early phases of verification projects.</p><p>For example, when parsing AlternateLem.  coqc dependency extraction extension: A proof term for a proof task in a .vio file is only available when the proof task completes. Yet, to properly update the identifier dependency graph for the next revision, all dependencies must be extracted from such terms. Consequently, we extended the coqc tool with an additional command that, when given a .vio file, its associated .v file, and a proof task, checks the task and then outputs all the dependencies in the proof term using the technique from coq-depends. Due to how the proof checking interface works in Coq 8.5, accessing the proof term is only possible when the proof is complete, i.e., has not been admitted. For this reason, ICOQ ignores checking proofs of admitted lemmas, although changes in their statement (type) can lead to checking of other proofs that depend on them. Since our coqc extension only uses the existing proof checking facilities, it does not affect the soundness of Coq.</p><p>Dependency graph builder and analyzer: We implemented our own dependency graph builder and dependency analysis in Java. The resulting program reads files (mostly in JSON format) output by the Coq tools and plugins, as well as JSON representations of dependency graphs from previous revisions, and finally writes the updated dependency graphs to disk.</p><p>Toolchain workflow: If all proofs in a .v file need to be checked, compiling a .vo file is usually significantly faster than first producing a .vio file and then executing all proof tasks. Consequently, we compile all .v files in the initial revision of a project into .vo files, and via those, extract dependencies directly from both proofs and definitions.</p><p>For subsequent revisions, the toolchain workflow (illustrated in Figure <ref type="figure">8</ref>) follows the general steps of the technique outlined in section III-A. First, the Java program reads the JSON representations of the file-level and identifier-level dependency graphs from the last revision. Then, it computes checksums of all .v files in the revision, runs coqdep on changed files, parses the output, and updates the file-level dependency graph. Using the graph, the program calls coqc to quick-compile all impacted files into .vio files. Then, it runs coqdigest on all new and changed .v files, and coq-ast on their .vio counterparts, obtaining (via parsing of JSON files) checksums for all identifiers and a list of proof tasks. This is sufficient to enable marking all impacted identifiers in the dependency graph. From the updated graph, the program obtains and runs all proof tasks associated with impacted identifiers using the extended coqc command, and then parses and incorporates the JSON output into the fine-grained dependency graph. Finally, it uses coq-depends to obtain the dependencies of all impacted non-proof identifiers, writing the up-to-date graph to disk along with the file-level graph.</p><p>IV. EVALUATION To assess the usability of ICOQ on large verification projects, we answer the following research questions: RQ1: How effective is ICOQ (compared to the state-of-theart techniques), i.e., what is the reduction in the number of checked proofs? RQ2: How effective is ICOQ in terms of the proof checking time in a continuous integration environment? RQ3: How effective is ICOQ in terms of the proof checking time outside a continuous integration environment (i.e., for verification on a user's machine)?</p><p>We run all experiments on a 4-core Intel Core i7-6700 CPU @ 3.40GHz with 16GB of RAM, running Ubuntu 14.04 LTS.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Verification Projects Under Study</head><p>Table <ref type="table">I</ref> shows the list of projects used in our study; all projects are publicly available, all but one on GitHub <ref type="bibr">[3]</ref>. We selected projects based on (a) public availability of their revision history during principal development, (b) compatibility of their revision history with Coq 8.5, (c) their size and popularity, and (d) our familiarity with their codebases; the latter was necessary for a successful experimental setup. For each project, we list the name, reference the repository location, and show the last revision/SHA we used for our experiments, the number of lines of Coq code (LOC) for the last revision (as reported by cloc <ref type="bibr">[1]</ref>), and the number of revisions for the experiments. Based on projects' characteristics, we say that the first four projects are micro-benchmarks, and the other three projects are large-scale proof developments.</p><p>Verdi and Verdi Raft: Verdi is a framework for verification of implementations of distributed systems <ref type="bibr">[62]</ref>. While the framework is not currently tied to any one particular verification project, it was initially bundled with a verified implementation of the Raft distributed consensus protocol <ref type="bibr">[63]</ref>. We consider revisions from Mar to Jun 2016, before Verdi and the Raft implementation were separated. Each revision comprises over 50k LOC, making Verdi one of the largest publicly available software verification projects. Many Verdi proofs use extensive custom tactic-based automation; the resultant long proof-checking time was one of the initial motivations for developing ICOQ.</p><p>UniMath: UniMath is a comprehensive library of formalized mathematics based on the univalent interpretation, suggested by Voevodsky, of the types in Coq as so-called homotopy types rather than mathematical sets <ref type="bibr">[56]</ref>. The revisions of UniMath under study are from Jan to Mar 2016, and each consist of more than 43k LOC.</p><p>Flocq: Flocq is a Coq library that formalizes floating-point arithmetic in several representations <ref type="bibr">[13]</ref>, e.g., as described in the IEEE-754 standard. Flocq is used in the CompCert verified C compiler to reason about programs which use floating-point operations <ref type="bibr">[12]</ref>. We considered revisions of Flocq from Jan to Mar 2016, each consisting of more than 22k library LOC.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Variables</head><p>Independent variables: We manipulate two independent variables: proof checking techniques and the development environment. Regarding the proof checking techniques, we use (a) Coq's timestamp-based toolchain that we described in Section II (we refer to this technique as coq_makefile), and (b) ICOQ that implements regression proof selection.</p><p>Our development environments include CI-Env and LO-Env. CI-Env describes an environment that uses a Continuous Integration Service (CIS) to check proofs. Note that a CIS checks proofs in a clean environment for each revision. LO-Env describes an environment where developers use their local machines to check proofs. Note that file timestamps are preserved in the latter case, but not in the former.</p><p>Dependent variables: Our dependent variables measure the effectiveness of proof selection techniques at reducing the amount of effort required to reproof modified programs. To do this we compute the proof selection percentage and measure the proof checking time. The proof selection percentage is derived from the ratio of selected proofs to the total number of available proofs executed by coq_makefile in the CI-Env environment. We use P sel to denote this variable. Proof checking time is measured as the end-to-end time that includes all phases (described in detail in Section III) of our proof selection technique.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Experiment Procedure</head><p>Figure <ref type="figure">9</ref> illustrates our experiment procedure that collects the data necessary to answer our research questions. As input, the procedure accepts one of the projects under study (Table <ref type="table">I</ref>), a number of revisions to be used in the experiment, and a development environment (either CI-Env or LO-Env). In the initial step (line 2), the procedure clones the project repository from the URL in Table <ref type="table">I</ref>. Next, the procedure iterates over the latest &#954; revisions, from the oldest to the newest revision. In each iteration of the loop, the procedure (a) obtains a copy of the project for the current revision (line 4), (b) configures the project (as the preparation for the proof checking), and (c) selects proofs that are affected by changes and checks those end for 11: end procedure proofs. Finally, if the procedure is simulating the CI-Env, the timestamps of all files have to be updated.</p><p>It is important to observe that we need to save dependency files for ICOQ between two revisions. Recently, many CISs have started supporting caching <ref type="bibr">[4]</ref>, <ref type="bibr">[49]</ref>, which we can utilize to store the dependencies. Considering that caching is fast and ICOQ's dependency files are small, we do not associate any overhead with keeping dependencies in the CI environment.</p><p>One of the key steps in the experiment procedure is to select and check proofs (line 6). During this step, our procedure stores the execution logs, which include the list of selected proofs and the proof checking time. We analyze these logs in the following subsection to answer our research questions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Results</head><p>We obtained all necessary data by invoking the procedure in Figure <ref type="figure">9</ref> twenty-eight times: one invocation for each project in Table <ref type="table">I</ref>, two proof checking techniques (coq_makefile and ICOQ), and two environments (CI-Env and LO-Env). In total, we selected and checked proofs on 112 revisions.</p><p>RQ1: How effective is ICOQ (compared to the state-ofthe-art techniques), i.e., what is the reduction in the number of checked proofs?: Figure <ref type="figure">10</ref> shows the proof selection percentage for two (out of four) micro-benchmarks. We can observe substantial reduction in the number of executed proofs at many revisions. Overall, across all revisions, we find that ICOQ executes 226 (on average 22.60) and 398 (on average 39.80) proofs for InfSeqExt and StructTact, respectively. On the other hand, we find that the coq_makefile technique executes 1,240 (on average 124.00) and 1,635 (on average 163.50) proofs for InfSeqExt and StructTact, respectively. In other words, ICOQ reduces the number of checked proofs by 81.78% and 75.66% for InfSeqExt and StructTact, respectively. Although we obtained proof selection percentages for the other two micro-benchmarks (WeakUpTo and CTLTCTL), we do not show these numbers because the developers of the projects have not changed any code in the last 10 revisions. As expected, ICOQ has not selected any proofs for execution. Note that open-source projects have frequent non-code changes that have no impact on tests/proofs <ref type="bibr">[24]</ref>; these changes can include changes in documentation and metadata files.</p><p>Finally, we show the results for the three largest projects used in our study. We format the results slightly differently for several reasons, including a large number of revisions and a low proof selection percentage that is not appropriate to be visualized with a bar chart. Table <ref type="table">II</ref> shows the results; the table contains two parts, and we discuss each part in turn.</p><p>The top part of the table shows result summaries for each project; the sum and the average values are computed across 24 revisions. The third column shows the number of proofs selected by ICOQ and the fourth column shows the total number of proofs at each revision; the fifth column shows the proof selection percentage. For example, for Verdi, we find that ICOQ executes a total of 4,458 proofs, while the existing technique executes 65,413 proofs across the same set of revisions. In other words, across all revisions, the proof selection percentage for ICOQ is 7%. Note that the proof selection percentage is the same regardless of the execution environment (CI-Env vs. LO-Env).</p><p>The bottom part of the table shows detailed results for Verdi. We show the values for each revision; the revision SHA is shown in Column 2.  RQ2: How effective is ICOQ in terms of the proof checking time in a continuous integration environment?: We used the three large verification projects not only to obtain a proof selection percentage but also to obtain the proof checking time. First, we consider the CI-Env development environment. Recall that in CI-Env, coq_makefile will always execute all proofs and thus be costly. On the other hand, ICOQ saves time by only running a subset of all proofs. Table <ref type="table">II</ref> shows the proof checking time. Columns 6 and 7 show the proof checking time for CI-Env when using coq_makefile and ICOQ, respectively. Table <ref type="table">III</ref> shows the summaries. In summary, ICOQ reduces the proof checking time 2.92&#215;, 3.44&#215;, and 9.62&#215; for Flocq, UniMath, and Verdi, respectively. Note that CI-Env is of the highest importance due to the proliferation of CISs.</p><p>Although we also measured proof checking time for microbenchmarks, we find that the time savings are insignificant in those cases due to very fast proof checking. Similar to regression test selection tools, which inspired our work, we believe that ICOQ will be most beneficial to large verification projects with many dependencies and elaborate proofs.</p><p>RQ3: How effective is ICOQ in terms of the proof checking time outside a continuous integration environment (i.e., for verification on a user's machine)?: We were curious what savings could be obtained with ICOQ in the LO-Env development environment. As when obtaining our answer to the previous question, we measured the proof checking time for large verification projects. Columns 9 and 10 in Table <ref type="table">II</ref> show time for coq_makefile and ICOQ, respectively. We can see that coq_makefile can save some proof checking time in LO-Env, i.e., whenever changes do not affect code. However, even if a change has minimal effect on code (e.g., in revision 9403f6f5 for Verdi), coq_makefile runs (almost) all proofs. We find (Table <ref type="table">III</ref>) that ICOQ reduces the proof checking time 1.13&#215;, 2&#215;, and 3&#215; on average, for Flocq, UniMath, and Verdi, respectively.</p><p>We believe the greater reduction in proof checking time for Verdi is primarily due to its many long-running proofs and opaque constants (that end in Qed.). In contrast, UniMath contains many non-opaque constants whose processing cannot be deferred during quick compilation, and nearly all proofs in Flocq have a relatively short running time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. DISCUSSION</head><p>Safety: In a regression testing context, a test selection technique is safe when, for every possible change to a project, the technique never omits to run a test affected by the change <ref type="bibr">[47]</ref>. Analogously, a proof selection technique is safe whenever no necessary proof checking task is ever omitted. ICOQ currently gives no formal guarantee of safety in this sense; a proof of safety would have to reason about Coq's toolchain, which is certainly possible at an abstract level, but difficult to do at the level of code. Nevertheless, verifying safety for a proof selection algorithm for Coq and Gallina is arguably more straightforward than doing so for a test selection algorithm for an object-oriented language with elaborate semantics (e.g., Java), which may include complicated features such as dynamic class loading.</p><p>Tactic language dependencies: ICOQ currently does not perform parsing and dependency analysis of custom tactics defined in the Ltac language that occur in source files. This means that an isolated change in the definition of a tactic never results in lemmas whose (unedited) proof scripts contain calls to that tactic being marked as "changed", even though the semantics of such a proof script may have changed. Analysis of Ltac definitions is a planned future extension of ICOQ. Similar concerns as for Ltac hold for custom Coq language extensions written in OCaml that are used in some projects.</p><p>Universe constraints: Sozeau and Tabareau recently introduced support for generic Coq definitions that can be used across universes of types <ref type="bibr">[48]</ref>. However, Coq's toolchain for asynchronous proof processing ignores universe constraints, since such constraints must be checked for consistency at the global level <ref type="bibr">[51]</ref>. Consequently, Coq projects that make heavy use of universe polymorphism are not good targets for ICOQ.</p><p>Parameterized modules: A Coq module encapsulates a collection of definitions and lemmas in a namespace. A parameterized module, or functor, takes modules with a certain signature as input, and can contain lemmas involving types in its parameters. Consequently, the file that contains the functor has corresponding proof tasks for those lemmas. However, no identifiers are exposed at the global level until the functor is fully instantiated with argument modules, eluding coq-ast. This problem can be solved, e.g., by conservatively compiling the file to a .vo file, checking all proofs. However, functors appear to be used rarely outside of the standard library; of the projects under study only Verdi uses them, and in a minimalistic way. Hence, we omitted support for functors in the initial version of ICOQ.</p><p>Overhead: ICOQ introduces several sources of overhead compared to LCF-style top-down processing of .v files into .vo files. One source is quick compilation and task-based proof checking itself, which is performed in independent phases and requires book-keeping for lemmas and proofs. Additionally, ICOQ requires computing a fine-grained dependency graph and checksums to discover the impact of changes to a development. Consequently, ICOQ may not be suitable to use in small-scale projects, since the overhead can make regression proof selection as a whole take longer to complete than straightforward compilation to .vo files; similar conclusions were drawn for regression test selection <ref type="bibr">[24]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. THREATS TO VALIDITY</head><p>External: Our results may not generalize to all Coq projects. To mitigate this threat, we used several micro-benchmarks and three large projects. The large projects use different feature sets of Coq and target verification of disparate application domains.</p><p>We used 24 revisions per project (for large projects), from segments in the version histories with active development that were straightforward to compile with Coq version 8.5, the first version with asynchronous proof-checking support and the stable version available when we started development of ICOQ. Our findings could differ for longer sequences of revisions and different segments in software histories. The number of revisions was determined by the setup cost and recent studies of regression testing techniques <ref type="bibr">[24]</ref>.</p><p>Internal: Our implementation of ICOQ, as well as our evaluation infrastructure, may contain bugs. To mitigate this threat, we did extensive testing of our code and code reviews. In particular, we tested ICOQ on a benchmark set of pairs of revisions of small Coq developments representing typical changes to proofs and definitions.</p><p>Construct: We implemented proof selection only for a single proof assistant (Coq). Although our technique should be applicable to other proof assistants (e.g., Isabelle/HOL), further work is needed to confirm the applicability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VII. RELATED WORK</head><p>Incremental verification: Kurshan et al. <ref type="bibr">[36]</ref> consider the problem of incremental verification of models of systems, assuming full verification is expensive. They suggest techniques based on hashes of reduced models to avoid performing re-verification when the required properties still hold in a changed model. This is similar to smart hashing in regression testing <ref type="bibr">[24]</ref>. Henzinger et al. <ref type="bibr">[30]</ref> consider incremental verification of safety properties of programs using model checking. In contrast to regression proving, whose aim is to find failing proofs quickly, their approach uses previous results to attempt to automatically overcome instances where a program change makes verification fail. Bohme et al. <ref type="bibr">[11]</ref> introduced partitionbased regression verification that partitions the input space and gradually performs verification. Godlin and Strichman <ref type="bibr">[26]</ref> define regression verification as establishing the equivalence of successive, related versions of programs. In effect, regression verification is a strengthening of regression testing, which can only provide limited evidence of preserved functionality.</p><p>Parallel and asynchronous proof checking: Coq's 1970s precursor LCF was based on synchronous, sequential interaction between a human prover and the proof tool <ref type="bibr">[60]</ref>. This legacy is reflected in Coq's read-eval-print loop, and by extension, in the top-down interaction with Coq files in classic interfaces such as Proof General. Over time, both the assumption on synchrony and on sequential interaction have been reconsidered, which enabled us to develop ICOQ.</p><p>Support for parallelism in construction and checking of proofs to exploit multi-core hardware has been addressed previously in several proof assistants, notably Isabelle <ref type="bibr">[59]</ref> and ACL2 <ref type="bibr">[23]</ref>, <ref type="bibr">[45]</ref>. Isabelle leverages the support for threads in its "host" compiler, Poly/ML, to spawn proof checking tasks processed by parallel workers. Using a notion of proof promises, proofs that require some previous unfinished result can proceed normally and become finalized when extant tasks terminate. Isabelle also includes a build system with integrated support for checking of proofs and management of parallel workers. ACL2 uses the thread-based parallelism in LISP systems to, e.g., perform parallel proof discovery and fine-grained proof case checking. The lack of native threads in Coq's host language, OCaml, prevents similar low-cost fine-grained parallelism <ref type="bibr">[59]</ref>. However, more coarse-grained parallelism is possible at the level of processes.</p><p>Parallelism at the task level usually necessitates support for some form of asynchrony, which can then also be exploited at the user interface level to provide greater interactivity. Architectural changes in Isabelle towards a document-oriented asynchronous interaction model were pioneered by Wenzel <ref type="bibr">[60]</ref>, resulting in the Prover IDE (PIDE) framework. PIDE defines an XML-based protocol between a proof assistant backend and clients such as IDEs. Efforts to bring asynchronous interaction to Coq were initiated by Wenzel <ref type="bibr">[58]</ref> and Barras et al. <ref type="bibr">[6]</ref>, resulting in a new Isabelle-inspired document-oriented interaction model and support for asynchronous proof processing in Coq 8.5 <ref type="bibr">[7]</ref>. The potential of Coq's new document model to improve user productivity was highlighted in an extension to the Eclipse IDE called Coqoon by Faithfull et al. <ref type="bibr">[21]</ref>, which performs fine-grained monitoring of changes to Coq files and reactively processes modified definitions and proofs.</p><p>Regression testing: There has been more than three decades of work on regression testing techniques <ref type="bibr">[42]</ref>, <ref type="bibr">[64]</ref>. These techniques were the key inspiration for the work presented in this paper. Specifically, our work is closely related to regression test selection (RTS) <ref type="bibr">[10]</ref>, <ref type="bibr">[18]</ref>, <ref type="bibr">[24]</ref>, <ref type="bibr">[37]</ref>, <ref type="bibr">[42]</ref>, <ref type="bibr">[43]</ref>, <ref type="bibr">[46]</ref>, <ref type="bibr">[47]</ref>, <ref type="bibr">[52]</ref>, <ref type="bibr">[64]</ref>. Most of the pioneering work on RTS has studied techniques that collect, for each test, fine-grained dependencies, e.g,. statements and methods. These techniques are frequently unsafe (i.e., they may miss to select some affected tests) for modern programming languages. Recently, Gligoric et al. <ref type="bibr">[24]</ref> introduced Ekstazi, an RTS technique that collects dynamic file dependencies; Ekstazi is more inclusive than prior techniques. Interestingly, we have decided to use finegrained dependencies for proof selection in ICOQ. Our insight is that Gallina does not include the language features that make many RTS techniques unsafe for imperative languages, e.g., dependency injection, class inheritance, and macros. To the best of our knowledge, ICOQ is the first proof selection tool.</p><p>Build systems: Our dependency graph is similar to dependency graphs seen in build systems like Google's Bazel <ref type="bibr">[8]</ref> and Microsoft's CloudMake <ref type="bibr">[15]</ref>, <ref type="bibr">[20]</ref>. Bazel keeps track of dependencies on a level of targets. Similarly to how ICOQ discovers changed proofs and definitions, these modern build systems discover affected targets by computing checksums of the files used by the target and then marking all nodes/targets that depend on the modified node/target.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VIII. CONCLUSIONS</head><p>We presented a technique for regression proof selection in large-scale verification projects, and its implementation for the Coq proof assistant in the tool ICOQ. In particular, ICOQ is suitable for use in continuous integration systems to quickly find failing proofs in rapidly evolving projects. By tracking fine-grained dependencies, ICOQ avoids checking unaffected proofs as changes are made to files. Our evaluation shows that using ICOQ is up to 10&#215; faster than checking all proofs from scratch (which is typical in a CI setting). ICOQ can also be used from the command line, as an alternative to the default Makefile-based toolchain; our evaluation shows that ICOQ is up to 3&#215; faster in this case. While our implementation is Coqspecific, our technique works in any setting where it is possible to separate the processing of source files with proofs scripts into a fast pre-processing phase and a mostly independent, potentially time-consuming proof-checking phase.</p></div></body>
		</text>
</TEI>
