<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>UniLoc: Unified Fault Localization of Continuous Integration Failures</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2023 May</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10442436</idno>
					<idno type="doi">10.1145/3593799</idno>
					<title level='j'>ACM Transactions on Software Engineering and Methodology</title>
<idno>1049-331X</idno>
<biblScope unit="volume">Early access</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Foyzul Hassan</author><author>Na Meng</author><author>Xiaoyin Wang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Continuous Integration (CI) practices encourage developers to frequently integrate code into a shared repository. Each integration is validated by automatic build and testing such that errors are revealed as early as possible. When CI failures or integration errors are reported, existing techniques are insufficient to automatically locate the root causes for two reasons. First, a CI failure may be triggered by faults in source code              and/or              build scripts, while current approaches consider only source code. Second, a tentative integration can fail because of build failures              and/or              test failures, while existing tools focus on test failures only. This paper presents UniLoc, the first unified technique to localize faults in              both              source code              and              build scripts given a CI failure log, without assuming the failure’s location (source code or build scripts) and nature (a test failure or not). Adopting the information retrieval (IR) strategy, UniLoc locates buggy files by treating source code and build scripts as documents to search and by considering build logs as search queries. However, instead of naïvely applying an off-the-shelf IR technique to these software artifacts, for more accurate fault localization, UniLoc applies various domain-specific heuristics to optimize the search queries, search space, and ranking formulas. To evaluate UniLoc, we gathered 700 CI failure fixes in 72 open-source projects that are built with Gradle. UniLoc could effectively locate bugs with the average MRR (Mean Reciprocal Rank) value as 0.49, MAP (Mean Average Precision) value as 0.36, and NDCG (Normalized Discounted Cumulative Gain) value as 0.54. UniLoc outperformed the state-of-the-art IR-based tool BLUiR and Locus. UniLoc has the potential to help developers diagnose root causes for CI failures more accurately and efficiently.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>As an emerging software engineering practice <ref type="bibr">[18]</ref>, Continuous Integration (CI) <ref type="bibr">[24]</ref> enables developers to identify integration errors in earlier phases of the software process, signiicantly reducing project risk and development cost. Meanwhile, CI poses higher demands for eicient fault localization and program repair techniques to improve the continuous success of the practice. Speciically, prior work reports that, on average, the Google code repository receives over 5,500 code commits per day, which makes the Google CI system run over 100 million test cases <ref type="bibr">[65]</ref>. When any commit is buggy, the corresponding and follow-up integration trials (&#322;CI trials&#382; for short) will keep failing until the bug is ixed by another commit. A long-standing CI failure can stop developers from testing commits efectively <ref type="bibr">[6]</ref> and diminish people's conidence in adopting CI <ref type="bibr">[7]</ref>. Existing fault localization (FL) techniques either rely on bug reports or test failures to locate bugs in source code <ref type="bibr">[21,</ref><ref type="bibr">33,</ref><ref type="bibr">79,</ref><ref type="bibr">84]</ref>. However, CI failures bring new challenges to these techniques.</p><p>Challenge 1: Faulty build scripts. Unlike traditional fault localization scenarios where only source code is assumed to be buggy, CI failures can also be triggered by build coniguration errors and environment changes. In other words, build scripts can be faulty, but current techniques do not examine these scripts. A recent study <ref type="bibr">[56]</ref> on 1,187 CI failures shows that 11% of failure ixes contain build script revisions only, and 26% of failure ixes involve revisions to both build scripts and source iles. The study indicates that without considering build scripts, existing techniques are incapable of handling a large portion of CI failures (i.e., 37%).</p><p>Challenge 2: Non-Test Failure. A tentative integration may not proceed smoothly due to failures other than test failures, while existing fault localization techniques mainly consider test failures to locate bugs. Speciically, Rausch et al. <ref type="bibr">[56]</ref> found that among the ive major reasons for CI failures, dependency resolution, compilation, and coniguration errors account for 22% of the scenarios, while quality checking errors and test failures separately take up 13% and 65%. This inding implies that current tools are applicable to at most 65% of CI failures. To facilitate discussion, we name all failures other than test failures as Non-Test Failure. Existing approaches are unable to localize bugs for non-test failures.</p><p>To overcome the above challenges, we developed a novel approach&#208;UniLoc&#208;to suggest a ranked list of candidate buggy iles given a CI failure log. Unlike existing fault localization techniques, UniLoc takes both source iles and build scripts into consideration, and conducts uniied fault localization to diagnose both test failures and non-test failures. The key insight behind UniLoc is that the CI failure log, the source iles and build scripts, and ile changes in commit history can complement and cross validate with each other to reduce additional noises in this heterogeneous environment. In particular, we adopted the information retrieval (IR) strategy by treating iles as documents (), and by considering the failed logs () as search queries to retrieve documents. Similar to prior work <ref type="bibr">[58,</ref><ref type="bibr">73]</ref>, UniLoc also extracts Abstract Syntax Trees (ASTs) from source iles and build iles to divide large documents into smaller ones. However, existing IR-based tools cannot perform uniied fault localization for two reasons:</p><p>1) Noisy data in . Prior IR-based fault localization (IRFL) work uses a given bug report as a whole to retrieve related source code, assuming that everything mentioned in the report is relevant. However, CI logs are usually lengthy and contain lots of information irrelevant to any failure. Existing approaches do not reine to reduce the noise. 2) Noisy data in . Prior IRFL work relies on textual relevance to locate bugs given a bug report. However, textually relevant iles may not be involved in a failed CI trial, depending on the build coniguration. Existing tools do not reine based on the build-target dependencies between modules.</p><p>UniLoc solves the above-mentioned issues by <ref type="bibr">(1)</ref> optimizing queries to remove noisy information, (2) optimizing documents to remove iles unrelated to a failing build, and (3) tuning candidate ranking to prioritize the most recently changed iles. To optimize queries, the work applies a text dif algorithm between the passed build log and failed build log to extract failure related text. Even after acquiring the failure-related texts, the query can still have noises like time and build-process-related information. Such repeated noises are removed through a similarity based approach. At the same time, search space is optimized by extracting important source code and build scripts applying AST analysis to them. Such AST analysis for search space optimization includes only important terms of software entities (e.g., class names, methods names, build dependency names, etc.) instead of considering full source code and build scripts that may contain noisy terms. Details of AST-Based entity extraction is discussed in subsubsection 3.2.2. Moreover, static build dependency analysis was applied to rule out iles and project modules that are not associated with the CI failure. Finally, candidate ranking was optimized based on the heuristics that the recent changes are more likely to be the root cause of the CI failure.</p><p>To evaluate UniLoc, we collected 700 real CI failure ixes in 72 GitHub projects from the TravisTorrent dataset <ref type="bibr">[19]</ref>. We used earlier 100 ixes for parameter tuning and the remaining 600 ixes for evaluation.</p><p>As with prior work <ref type="bibr">[21,</ref><ref type="bibr">51,</ref><ref type="bibr">58,</ref><ref type="bibr">84]</ref>, we evaluated UniLoc's efectiveness by measuring Top-N (Recall at Top N), MRR (Mean Reciprocal Rank), MAP (Mean Average Precision), and NDCG (Normalized Discounted Cumulative Gain). Our evaluation shows that UniLoc located buggy iles with 65% Top-10, which means that among 65% of the scenarios, UniLoc successfully included buggy iles in the Top-10 recommendations. On average, the MRR, MAP and NDCG values of UniLoc were 0.49, 0.36 and 0.54, respectively.</p><p>This paper is the irst work on uniied fault localization of both test failures and non-test failures. To compare UniLoc with existing code-oriented IRFL, we applied widely used IRFL approaches BLUiR <ref type="bibr">[58]</ref> and Locus <ref type="bibr">[73]</ref> to the 600 bug ixes. <ref type="bibr">[22,</ref><ref type="bibr">38]</ref>. The MRR, MAP and NDCG values of BLUiR were 0.29,0.19 and 0.39 respectively, much lower than those values of UniLoc. For the case of Locus, MRR, MAP and NDCG values are 0.12, 0.09 and 0.22, respectively, which are also much lower than those values of UniLoc. Such results suggest that existing IR-based FL techniques can localize CI failures to some extent, but they may not efectively localize CI failures in many cases and their overall performance is lower than UniLoc. Furthermore, UniLoc optimizes (a) queries, (b) the search space, and (c) candidate ranking to improve fault localization. To learn how sensitive UniLoc is to each applied optimization strategy, we evaluated three variants of UniLoc with one strategy removed for each variant. Our experiment shows that all three strategies are useful, and the optimization of candidate ranking boosts the efectiveness most signiicantly.</p><p>We summarize the contributions of this paper as follows:</p><p>&#8226; We developed a uniied fault localization approach UniLoc that considers both source code and build scripts to diagnose CI failures. UniLoc includes novel techniques to extract optimized queries from failed build logs, to generate optimized document sets from source iles and build scripts, and to rank suspicious iles with IR scores and commit history data. &#8226; We constructed a data set of 700 CI failure ixes together with the related failure-inducing commits from real-world projects in Github. We open-sourced the data and implementation to facilitate future research in CI failure repair. Our data and program are separately available at <ref type="url">https://sites.google.com/view/uniloc</ref> and <ref type="url">https://github.com/foyzulhassan/UniLoc</ref>. &#8226; We conducted a comprehensive evaluation to evaluate the efectiveness of UniLoc. We explored how UniLoc works diferently from source-code-oriented IR-based fault localization techniques. We also investigated how diferent optimization strategies afect the efectiveness of UniLoc.</p><p>The organization of the paper is as follows. After describing the background of this work in Section 2, we introduce UniLoc in Section 3. Section 4 explains evaluation and Section 6 discusses the generalization of our approach. We expound on the related works and conclusion in Section 7 and Section 8, respectively.</p><p>include "spock-bom" include "spock-core" include "spock-specs" include "spock-spring" include "spock-spring:spring2test" ...   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Information Retrieval (IR)</head><p>Given a text query , an Information Retrieval system searches among the document corpus () for relevant documents. To retrieve documents relevant to , the IR system computes a similarity score between each document &#8712; and , and ranks documents in descending order of the scores. Below is a frequently used formula for similarity calculation:</p><p>-&#8594; and -&#8594; are term weight vectors of and , while (, ) is the cosine similarity of the two vectors. In particular, the term weight values in each vector are determined by term frequency (TF) and inverse document frequency (IDF). In a typical IRFL approach, source iles are treated as documents, while bug reports are used as queries. Similarity scores are computed to assess how probably a ile is buggy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Project Dependencies</head><p>A build system is an infrastructure to convert source code into artifacts such as modules, libraries, and executable binaries <ref type="bibr">[47]</ref>. A build script speciies how to generate and test artifacts for software projects. In Gradle, a big project can be divided into several sub-projects. The dependencies between sub-projects are deined in build scripts, where the overall project is referred to as root. When developers commit program changes, not every sub-project needs to be rebuilt or retested. Instead, the build system only compiles and tests the sub-projects being changed and those sub-projects depending on the changed ones.</p><p>Figure <ref type="figure">1</ref> presents three Gradle scripts deined in the project spockframework/spock. In Figure <ref type="figure">1</ref> (a), script settings.gradle shows that the project includes multiple sub-projects, such as spock-bom and spock-core. In Figure <ref type="figure">1</ref> (b), compile/testcompile project(":spock-core") means that spock-core is needed for Gradle to compile the owner sub-project. Figure <ref type="figure">1</ref> (c) indicates that spock-core is needed to compile and test spock-specs.</p><p>The dependencies between sub-projects can be utilized in UniLoc. For instance, when spock-report does not compile, only its source code and those sub-projects or libraries on which spock-report transitively depends should be examined.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">APPROACH</head><p>Figure <ref type="figure">2</ref> overviews the design of UniLoc. We envision UniLoc to be used by developers when they notice a failed log ever since the most recent passed log.</p><p>Speciically, UniLoc takes in three inputs: the most recent passed build log , the failed build log immediately after , and the commit history which includes the most recent passed version , the failed version immediately after , and the failure-inducing single commit (ile changes) between the two versions and . Prior studies on quality issues of CI process <ref type="bibr">[67,</ref><ref type="bibr">82]</ref> suggested that developers should follow the process of immediate small integration rather than waiting for large or merged integration to avoid spaghetti of error during integration. Therefore, we consider that developers are integrating code immediately and is the commit that introduced the build failure. To suggest a ranked list of potential buggy iles, UniLoc consists of the following four phases:</p><p>&#8226; Phase 1 compares with to locate failure-relevant description and to compose a query (Section 3.1). &#8226; Phase 2 retrieves from and creates project dependency graphs based on build scripts. With the from Phase 1 and recognized dependencies, this phase reines the search scope (Section 3.2). &#8226; Phase 3 compares each document within the scope against to calculate the similarity score and rank documents accordingly (Section 3.3). &#8226; Phase 4 retrieves , the failure-inducing commit, and extracts the ile-level change information to improve the ranking formula (Section 3.4).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">uery Optimization</head><p>To query for buggy documents with , we decided not to use every word in the log. This is because although there can be thousands of lines of build information in a failed log, only a very small portion of those lines are failure-related. Including unrelated information in a query will cause severe noises when we match with documents. We extracted the failure-related part from by taking two steps: (i) query optimization with text dif (Section 3.1.1), and (ii) noise removal with text similarity (Section 3.1.2).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.1">uery</head><p>Optimization with Text Dif. We observed that a failed log could contain duplicated description with a passed log. Such duplicated fragments are usually less informative than those fragments unique to the failed log. Inspired by prior work that uses binary ile diferentiation to locate unreproducible builds <ref type="bibr">[57]</ref>, we applied a textual diferentiation algorithm&#208;Myers <ref type="bibr">[52]</ref>&#208;to and to identify any failure-related part in . Myers inds the longest common sub-sequence of two given strings. The algorithm is based on the concept of inding the shortest edit script that can be modeled as a graph search. Myers algorithm is used as a popular dif utility as a comparison tool that displays line-by-line deletions and insertions for transforming one ile into another. Example 1 shows a passed log (commit:0545247) and a failed log (commit:bf25fdf) of the project BuildCraft/BuildCraft. and denote the CI failed log at version and the latest successful CI log at version before version . The delta between and can include source iles, or build iles or both. We only considered the latest successful version before , as other successful versions may include logs generated by successfully integrating code segments or build logic that is not part of the failure description and may create additional noise. In Example 1, after utilizing Myers algorithm, it shows the unique fragments in the passed log are highlighted with gray, while the unique fragments in the failed log are highlighted with red and yellow. UniLoc can extract these fragments using Myers. &#208;the part that successfully matches certain segment(s) in , and &#208;the unique part of . Actually, may still contain segments unrelated to the failure. This is because some program logic changes (e.g., adding new tests), environment changes (e.g., removing dependencies on libraries), and random issues (e.g., multithreading) can also make look diferent from previous logs. In Example 1, yellow-colored text block presents such an example of noise due to the change in the download process.</p><p>Failure-irrelevant lines in are not responsible for CI failures, and they may be similar to fragments in . To further remove such failure-irrelevant noise, we conducted line-to-line comparison between and using Myers algorithm <ref type="bibr">[17]</ref>. If the similarity between any two lines ( , ) is above a threshold , we consider the lines to match, so should be removed from . Since error-related segments of are usually very diferent from the normal output of , this noise removal approach is unlikely to remove error-related segments. With this step, we can remove the yellow segment in Example 1 while retaining the two checkstyle errors. To decide the optimum value of , we used 100 CI failure ixes in our dataset (see Section 4.1) as the tuning set. Note that the 100 CI failures are not used in the evaluation set and they are all chronologically earlier than the CI failures in the evaluation set. We found 0.9 to be optimal and thus set = 0.9 by default. Finally, we denote the reined failure-relevant part with &#8242; , which is used by UniLoc to compose a query .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Search Space Optimization</head><p>In addition to optimizing queries, we also optimized the search space for better fault localization. This phase consists of two steps: i) dependency-based sub-project selection, and ii) AST-based entity name extraction.</p><p>3.2.1 Dependency-Based Sub-Project Selection. As described in Section 2.3, Gradle build scripts specify dependencies between sub-projects. We developed a parser to analyze those build scripts and to extract the dependencies. With the extracted dependencies for each software project, we constructed a build graph = (, ), where = { 1 , . . . , } is the set of sub-projects and is the set of dependency edges. There is a directed edge from to if and only if depends on . Gradle builds each sub-project only after building all the projects on which depends.</p><p>To reduce search space, UniLoc inds the mentioned sub-project that is closest to the CI failure in &#8242; . Starting from , UniLoc traverses the dependency graph to ind all sub-projects on which depends. UniLoc then includes the source iles and build scripts of these sub-projects into the search scope, because only these documents are involved in the CI trial for and may be responsible for the failure.</p><p>Figure <ref type="figure">3</ref> shows part of the dependency graph of spockframework/spock. According to the graph, spock-core depends on root while being dependent on by spock-specs and spock-report. In this Figure, the CI Error Log Part shows that the failure occurs in spock-specs. With the project dependencies, UniLoc can skip spock-report when searching for buggy iles because this project has no dependency relationship with spock-specs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.2">AST-Based Entity Extraction.</head><p>Prior work <ref type="bibr">[58]</ref> shows that IRFL techniques work better, if the search space includes only names of software entities (e.g., class names, methods names) instead of all source code that contains noisy details. Although we could adopt this technique <ref type="bibr">[58]</ref> for source code, the proper software entities to be used may be diferent in the FL scenarios of CI failures and there is no counterpart for build scripts. Therefore, we developed a uniied mechanism on source iles and build scripts to identify the top software entities to be included in the search space.</p><p>First, we used UniLoc to generate ASTs for sources iles / Gradle build scripts, and to extract AST nodes and their textual values. Then, for each subject project in our tuning set, we searched for the textual values in its build scripts and build failure logs. Finally, we counted the frequency when the value of a speciic type of AST node can be found in build failure logs. We consider four (we use four to be consistent with <ref type="bibr">[58]</ref>) AST node types with the highest frequency as top software entities. Note that this is the same tuning set we used for query noise removal (See Section 3.1.2). In Tables <ref type="table">1</ref> and<ref type="table">2</ref>, we show the statistical results for Java source code and Gradle scripts. The second column shows the number of build failures where at least one AST node of the type has its  textual values appearing in the failed build log. The bold part of the third column shows an exemplar textual value of the AST node type (the rest part of the column gives some context). As shown in Tables <ref type="table">1</ref> and<ref type="table">2</ref>, ield names, method names, class names, and import items are top four software entities in Java source code; dependency items, property deinitions, module names, and task deinitions are top four software entities in Grade build scripts. Therefore, we included only the textual values of these entities in the search space.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Similarity Score-Based File Ranking</head><p>Traditionally, there are two main types of automatic FL techniques: spectrum-based vs. IR-based. Spectrum-based techniques exploit the execution coverage information of passed and failed tests to rank suspicious iles. The reason why we chose to take an IR-based approach is two-fold. First, faults in a project may exist in source iles of various programming languages and build scripts. Spectrum-based FL techniques must instrument diferent language implementations simultaneously to proile executions and analyze test failures. However, we intended to locate faults in build scripts even though no test failure exists. Second, the instrumentation by spectrum-based techniques can modify program behaviors, introduce runtime overhead to program executions, and can make some failures impossible to reproduce (e.g., laky tests). Without reproducing those failures, spectrum-based techniques cannot locate any bug.</p><p>We reused the implementation in Lucene <ref type="bibr">[3]</ref> for the IR technique described in Section 2.2. Given &#8242; and the reined search scope, this implementation creates a query vector and a set of document vectors . In each document &#8712; (i.e., Java ile or build script), we conducted a separate search for each type of software entities. For Java iles, we calculated the similarity scores between and for the set of entity types includes: ield names, method names, class names, and import items. Then we leveraged the average value between the similarity scores to compute the overall similarity between and :</p><p>In Formula (2), if is a build script, the set of entity types includes: dependency item, property deinition, module name, and task deinition. The score is within [0, 1]. The higher score a document has, the higher it is ranked.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Ranking Adjustment</head><p>To further improve ile ranking, we leveraged an intuition that the changed iles in a failure-inducing commit and the ile names mentioned in &#8242; are more suspicious than other iles. If developers noticed a failed version after the most recent passed version , we considered all changed iles between and together with the iles mentioned in &#8242; to be suspicious iles. We considered only changes between and because prior study <ref type="bibr">[29]</ref> identiied that if there is CI failure, in most cases, CI outcome remains unchanged with a median of four CI failures and a maximum of 760 consecutive CI failures. Among these consecutive CI failures, identifying the culprit commit is challenging as failures can be just failures related to dependency of the irst failure, or they could separate independent failures due to the concurrent nature of the code commit. Additionally, applying a binary search-based approach to detect culprit commit might not work in cases where commits are dependent on each other and may take a long processing time if there are multiple consecutive CI failures. For simplicity, we assume that after each CI failure, developers will analyze and debug the failure, so we only considered after the most recently passed version . Furthermore, we deined the following formula to adjust similarity scores for documents:</p><p>In Equation <ref type="formula">3</ref>, if a ile is suspicious, we raise the score to (, ) (0 &#8804; &#8804; 1); otherwise, the score (, ) remains the same. Note that in Xuan et al. 's prior work <ref type="bibr">[80]</ref>, they deined a formula to boost suspicious scores when certain iles were recently changed, and we were inspired by their formula. We opted to do ile-level fault localization in ile level as most of CI failures are required to ix in multiple iles and multiple lines, which is ill-suited for line-level fault localization <ref type="bibr">[44]</ref>.</p><p>To ind the optimal value of , we varied from 0.0 to 1.0 with 0.1 increment, and conducted experiments with the parameter-tuning dataset mentioned in Section 3.1. The experiments showed that 0.1 is the best setting, so we set = 0.1 by default.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">EXPERIMENTS AND ANALYSIS</head><p>In this section, we will irst introduce our dataset (Section 4.1) and evaluation metrics (Section 4.2). We will then describe the research questions we explored (Section 4.3), and inally discuss the evaluation results (Section 4.4).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Dataset</head><p>We constructed our evaluation dataset based on TravisTorrent <ref type="bibr">[19]</ref>, which is a dataset of CI builds. We used the SQL dump version of the dataset dated February 8, 2017, which version contains the build data collected from 2011-08-29 to 2016-08-30. For Java projects, TravisTorrent provides CI log data for three build systems: Ant, Maven and Gradle, and each CI log is a plain text ile. A recent study <ref type="bibr">[62]</ref> shows that more than 50% of top GitHub projects are using Gradle as their build coniguration tool, so we focused our approach implementation and evaluation on Gradle-based projects.</p><p>In TravisTorrent, we irst extracted all CI failures for any Java project built with Gradle. For each CI failure, we further extracted the most recently passed version as , and its corresponding log as . We also extracted the failed version immediately following and its corresponding log as . These are the information available in the FL scenario of each build failure, so we used them as the input of our tool. We further extracted the following failure-to-pass transition to ind out how the failure is ixed. In particular, all changed iles in the commit that leads to the irst following passed build are considered ground truth of fault localization for this build failure.</p><p>Here we follow prior works in FL <ref type="bibr">[37,</ref><ref type="bibr">55,</ref><ref type="bibr">69]</ref> to use all changes in the ixing commit as the ground truth. In CI, developers only change code when the build failure is conirmed to be related to a defect. With the restriction to have a failure-to-pass transition commit, we actually ruled out the lakiness-related failures that are reported in recent CI research work <ref type="bibr">[25,</ref><ref type="bibr">26]</ref>.</p><p>With our data collection method, we identiied 700 CI failures ixes from 72 Gradle-based projects. As shown in Table <ref type="table">3</ref>, we used the chronologically earliest 100 of the ixes for parameter tuning (see Section 3.1 and Section 3.4), and for software entity selection in Java source code and Gradle build scripts (see Section 3.2.2). We used the remaining 600 ixes to evaluate the efectiveness of our fault localization techniques. In the dataset, the average number of source iles and the average number of build script iles per project is 444.32 and 9.24, respectively. Even though the number of build scripts per project is much lower than the source ile, the localization of build-related CI failures is challenging due to the abstraction of build logic and limited domain knowledge about the build process among the developers. Even in many cases, CI logs suggest that the failure is in the source ile, but in reality, the fault is in the build script(see <ref type="bibr">Example 4)</ref>. Furthermore, there is no or minimal support for debugging build scripts, which complicates the build script fault localization process. For example, Example 2 shows a CI failure where the log suggests that the failure is related to a plugin and it's in the Hystrix module. However, the ix shows that the failure is due to io.reactivex:rxjava dependency and it's in the hystrix-core module build script. Such an example shows that ixing CI failures is challenging due to the high-level abstraction of build scripts and requires deep domain knowledge of build logic and project structure. Moreover, existing debugging tools can't be utilized to analyze such CI failures. So, having a tool even for ile-level FL can substantially reduce developer efort to localize build script related CI failures. More importantly, we manually inspected the 700 failed logs and their corresponding ixes. For manual inspection, irst we classiied failures into test failures and non-test failures using the build log analysis mentioned in prior work <ref type="bibr">[30]</ref>. Then the irst author manually conirmed the classiication by inspecting the logs of each failure. We clustered the data based on (1) the failure type and (2) bug locations. As shown in Table <ref type="table">4</ref>, 316 (45%) CI failures are test failure and 384 (55%) failures are due to non-test failure. This observation implies that existing FL techniques cannot handle most CI failures in our data set because they mainly rely on the existence of test failures. Moreover, among 316 test failure, 54 failures (17%) require ixes in build scripts. Although the Spectrum-Based fault localization (SBFL) technique works more precisely for test failures, current SBFL techniques only focus on source iles without handling build scripts. Furthermore, since SBFL techniques rely on instrumentation, it would be diicult to adapt them for build scripts due to the variety of build tools and plug-ins. In comparison, our new approach is more generic and more applicable because it can locate bugs in both code and scripts, no matter whether the failures are related to tests or not. Additionally, 95 CI failure ixes (14%) changed both source iles and build scripts, while another 99 failure ixes (14%) changed build scripts only. It implies that when current FL tools do not analyze build scripts to locate bugs, they can miss bug locations for many CI failures. In particular, Example 3 shows a CI failure related to the usage of a tool Crashlytics. To ix the failure, both a build script and a Java ile were modiied. In the build script, enableCrashlytics was set to false. In source code, the import declaration of class com.crashlytics.android.Crashlytics was removed. Example 4 shows another CI failure, which is triggered by a test failure. In the example, the unit test throws an exception ClassNotFoundException because of a missing dependency. Consequently, the related ix added the project dependency to a build script. To ix build script related failures like enableCrashlytics, developers need specialized knowledge and may need to spend a long time. In our dataset, source-code-only ixes account for 506 of 700 failures, and for the rest of 194 failures, their ixes involve at least one build script. Since in CI pipeline, developers need to ix CI failures as soon as possible to allow further integration, the commit time between failed build and successful build is a good indication of time spent for CI failures. According to our analysis, for source-only related ixes, the median time spent is 43.5 minutes and for build script related 194 failures, the median time spent is 73 minutes. This analysis also indicates the complexity of build script related failures.</p><p>Our prior inding indicates that a considerable portion of CI failures are not triggered by test failures or ixed by modiications in source code. Our dataset also demonstrates the need to develop a general fault localization technique that (1) analyzes both source iles and build scripts, and (2) handles non-test failure in addition to test failures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Evaluation Metrics</head><p>We used the following four widely used metrics <ref type="bibr">[21,</ref><ref type="bibr">51,</ref><ref type="bibr">55,</ref><ref type="bibr">77,</ref><ref type="bibr">81,</ref><ref type="bibr">84]</ref> to measure the efectiveness of FL techniques.  &#8226; Recall at Top N (Top-N) calculates the percentage of CI failures, which have at least one buggy ile reported in the top N (N=1,5,10, . . . ) ranked results. Intuitively, the more failures have their buggy iles recalled in Top-N results, the better an FL technique works. &#8226; Mean Reciprocal Rank (MRR) measures the precision of FL techniques. Given a set of queries, MRR calculates the mean of Reciprocal Rank values for all queries. The higher the value, the better. The Reciprocal Rank (RR) value of a single query is deined as:</p><p>Speciically, is the rank of the irst correctly reported buggy ile. For example, for a given query, if 5 documents are retrieved, and the 3 and 5 &#8462; are relevant, then RR is 1 3 = 0.33.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8226; Mean Average Precision (MAP) measures precision in a diferent way. It computes the mean of Average</head><p>Precision values among a set of queries. The higher value, the better. The Average Precision (AP) value of a single query is:</p><p>Here, is the rank, is the number of ranked iles and () is a binary indicator of relevance. () is the percentage of correctly reported buggy iles among the top results, and () is a binary indicator for whether or not the &#8462; ile is buggy. For example, if 5 documents are retrieved, and the 3 and 5 &#8462; are buggy, then AP is ( 1 3 + 2 5 )/2 = 0.37. &#8226; Normalized Discounted Cumulative Gain (NDCG) is a widely used metric in recommendation systems <ref type="bibr">[51,</ref><ref type="bibr">64]</ref>. The basic concept of this metric is to calculate the relative diference between the recommended ranking and the ideal ranking. NDCG is deined as follows:</p><p>where = 1 if the i-th source code ile is related to the fault location, and = 0 otherwise. IDCG is the ideal order of DCG, which means all the faulty iles are ranked higher than the unrelated iles. For example, if an approach recommends three iles in which the 3 ile is error-related, the results are represented as {0, 0, 1}, whereas the ideal recommendation is represented as {1, 0, 0}. For the given example, DCG value is 0.5 and IDCG vlaue is 1.0, then NDCG value is 0.5 1.0 = 0.5. NDCG value ranges from 0.0 to 1.0 and is correlated to the MAP metric as it also evaluates the position of ranked items.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Research uestions</head><p>In our experiment, we investigated the following ive research questions:</p><p>&#8226; RQ1 How efective is UniLoc to locate bugs for CI failures? Although fault localization research is an active research area for over a decade, prior works <ref type="bibr">[11,</ref><ref type="bibr">16,</ref><ref type="bibr">58,</ref><ref type="bibr">79]</ref> mostly focus on source code fault localization. In our work, we considered both source code and build script for fault localization. Therefore, it is important to measure the efectiveness of UniLoc in comparison with existing approaches. To understand whether UniLoc works better than naive approach of ile names mentioned in build error log ile and existing tools, we will apply our proposed UniLoc, ile name mentioned in error log ile and baseline IR-based techniques: BLUiR and Locus to the same data set.</p><p>Analysis Result: Our indings show that that UniLoc outperformed Baseline1 (ile name mentioned in log ile), Baseline2 (BLUiR) and Baseline3 (Locus) for all metric, speciically higher MAP and NDCG value indicate that UniLoc works better for diferent types of CI failure. &#8226; RQ2 What is the impact of recent change based ile ranking for build fault localization? As build failures are usually caused by recent commits, it might be a natural intuition that faults are in the recently changed iles. But our analysis inds that for many cases, ixes are in iles other than recently changed iles. To make a quantitative comparison, we compare UniLoc with an approach based on the changed iles in the ile-inducing commit.</p><p>Analysis Result: Our indings show that change history or reverting-based approach has limited capability of FL considering wide-range of CI failure types. Overall, the performance of the approach is lower than UniLoc. &#8226; RQ3 How sensitive is UniLoc to diferent parameter settings and strategies applied? To improve the performance of UniLoc, we developed diferent techniques such as query optimization, search space optimization, etc., we need to have an evaluation of the usefulness of these techniques. To understand how UniLoc works with diferent conigurations, we changed the parameter values and also created variant approaches by disabling one technique at a time.</p><p>Analysis Result: Our indings show that UniLoc is sensitive to both parameter &#208;the similarity threshold between two lines of build information and &#208;the exponential value used to improve ile ranking. Our approach worked best with = 0.9 and = 0.1. Apart from that, among the applied techniques: query optimization, search space optimization, and ile ranking optimization, search space optimization contributed most to improve UniLoc performance. &#8226; RQ4 How efective is our approach for failures to be repaired in source code only, build script only, and both? As shown in Table <ref type="table">4</ref>.1, CI failure ixes can be in source code, build script, or both. Prior research works <ref type="bibr">[58,</ref><ref type="bibr">73]</ref> considers only source code. Since UniLoc targets both source code and build script, we further measure the efectiveness of UniLoc on failures ixed at diferent locations.</p><p>Analysis Result: Baselines' performance varies a lot for diferent types of ixes, but UniLoc has a more robust and balanced performance among all three types of ixes and shows overall the best performance. &#8226; RQ5 How efective is our approach for diferent type of CI failures? A CI failure can be a test failure or a non-test failure. As characteristics of test failure and non-test failure might be diferent, measuring the performance of UniLoc for test failure and non-test failures can be useful to showcase the efectiveness of UniLoc for diferent types of failures.</p><p>Analysis Result: For both test failure and non-test failure UniLoc performs better than all three baselines. However, UniLoc works more efectively on non-test failure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Results</head><p>RQ1: Efectiveness of UniLoc. To understand the comparison between UniLoc and existing approaches, we applied UniLoc, ile name mentioned in &#8242; (Baseline1) and state-of-the-art IR-based FL techniques BLUiR <ref type="bibr">[58]</ref>(Baseline2) and Locus <ref type="bibr">[73]</ref>(Baseline3) to our dataset. Since &#8242; mentions error ile names, so during fault localization ile ranking for each ile names mentioned in &#8242; gets one point and other iles get zero. This comparison will help us know whether our proposed approach works better than a keyword-based approach that locates failures based on ile names mentioned in &#8242; part. Apart from that, we compared UniLoc with BLUiR <ref type="bibr">[58]</ref> and Locus <ref type="bibr">[73]</ref>&#208;state-of-the-art source-code-oriented IR-based FL techniques studied in recent studies <ref type="bibr">[22,</ref><ref type="bibr">39]</ref>. As BLUiR is not publicly available, we reimplemented BLUiR with default coniguration parameters and structural code entities mentioned in the paper. BLUiR uses bug reports as queries and searches source code for bugs, to facilitate comparison, we extended the tool in two ways. First, instead of feeding in a bug report, we used the reined failure-relevant part &#8242; as an input. We observed that CI logs are very large (typically more than thousands of lines) compared to bug reports and contain unrelated information such as downloading dependency, CI server-speciic information, etc. As a result, applying the full log will afect baseline approaches' performance. In fact, we utilized full log and &#8242; to BLUiR and observed that in terms of MMR and MAP metric performance degrades 17.13% and 22.79%, respectively, if we use full log. Moreover, a recent work <ref type="bibr">[59]</ref> on CI coniguration correctness also observed that the log error part contains important terms related to CI failure. So, rather than utilizing the full logs, we used &#8242; for baseline FL techniques. Second, for source code, we provided AST entities to BLUiR and for build script we provided all the build script contents as text to BLUiR. For evaluation, we utilized publicly available existing Locus implementation <ref type="bibr">[9]</ref>. However, we transformed our data to Locus compatible form with a data transfer process and modiied the Locus data input process to allow analyzing multiple projects in batch mode. Like BLUiR, Locus also uses bug reports as queries and searches source code for bugs. Locus utilizes bug open date and ix date to extract change history based on the dates. For applying Locus to localize CI failures, we used the reined failure-relevant part &#8242; as the bug description, the commit date that generated CI failure as bug open date and commit date that ixed CI failure as bug ix date. As CI failures do not have any failure summary, we kept Locus bug summary empty. Note that the identiication of &#8242; is based on our technique described in Section 3.1, so the baseline approaches already  partly take advantage of our query optimization technique (directly using the whole build log as the query will result in much worse results similar to random ile selection due to noises in the log). values are 6.33%, 18.00% and 24.17% respectively. In comparison, UniLoc's Top-1, Top-5, and Top-10 values are 39.33%, 62%, and 67.83%. UniLoc largely outperformed Baseline1, Baseline2 and Baseline3 by recommending more buggy iles in the Top-N results. For Baseline1, even the result was good for Top-1 as for some build failures like compilation error or static analysis error it mentions ile name in error log. But for complex cases where ile names are not directly mentioned or ix is in diferent ile location then Baseline1 does not work well. As a result for Top-10 it's performance is less efective than Baseline2, Baseline3 and UniLoc. Even though Baseline2 reviewand Baseline3 considers all the source iles and build iles for FL similarity analysis, Baseline2's and Baseline3's performances are lower than UniLoc. A possible reason for the lower performance of Baseline2 is that it is considering all iles with the same weight. On the other hand, Baseline3 considers all prior all changes histories to localize faults rather than only recent changes that triggers the CI failure. But for CI failure, failures are generated by recent changes. So, recent change history does have a high impact on CI fault localization and prior FL techniques <ref type="bibr">[35,</ref><ref type="bibr">70]</ref> also utilize change history to improve the performance. In Figure <ref type="figure">5</ref>, the Baseline1 technique achieved 0.38 MRR, 0.26 MAP and 0.41 NDCG, Baseline2 achieved 0.29 MRR, 0.19 MAP and 0.39 NDCG, while Baseline3 approach achieved 0.12 MRR, 0.12MAP and 0.22 NDCG. Our proposed approach UniLoc achieved 0.49,0.36 and 0.54 as MRR, MAP and NDCG, respectively, so UniLoc shows higher efectiveness than the baselines. Speciically, in Figure <ref type="figure">5</ref>, UniLoc has wider value ranges of both MAP and NDCG than baselines. Since MAP and NDCG metric considers all iles ranking rather than one best ile ranking, it means that UniLoc's efectiveness can vary on diferent CI failures.</p><p>Finding 1: UniLoc outperformed Baseline1, Baseline2 and Baseline3 for all metric, speciically higher MAP and NDCG value indicate that UniLoc works better for diferent types of CI failure.</p><p>RQ2: Recent Change History Based Fault Localization. We observed that 41.14% of CI failure ixes contain at least one line of change revert in source code or build script. So, we were curious about pure history-dependent fault localization. For Change History Based FL, we consider the changes in the failure-inducing commit. Instead of calculating similarity for these iles, we gave the inal score as 1.0 for the iles changed in the failure-inducing code commit. For other iles, the inal score is assigned as 0.0. With this change heuristic driven approach, we calculated Top N, MRR, MAP and NDCG metrics. We also compared the Change History Based approach with UniLoc. Table <ref type="table">5</ref> shows the efectiveness comparison in between the change based approach and our proposed approach. The Change History Based approach achieves 0.38 MRR, 0.28 MAP and 0.46 NDCG (compared to 0.49 MRR, 0.36 MAP and 0.54 NDCG of UniLoc). It also achieves 27.33%, 51.5%, and 60.5% for Top-1, 5, and 10 metrics (compared to 39.33%, 62.00%, and 67.83% of UniLoc). To explore UniLoc's sensitivity to these parameter settings, we tried ={0, 0.5, 0.6, 0.7, 0.8, 0.9} and changed between 0.0 and 0.9, with 0.1 increment. As shown in Figure <ref type="figure">6</ref>, UniLoc obtained the highest MRR value when =0.9 and =0.1.  There are three strategies applied in UniLoc: S1&#208;query optimization (Section 3.1), S2&#208;search space optimization (Section 3.2), and S3&#208;ile ranking optimization (Section 3.4). To understand how diferent strategies inluence UniLoc's efectiveness, we also created three variants of our tool: V1&#208;a variant without applying S1, V2&#208;a variant without S2, and V3&#208;a variant without S3.</p><p>Table <ref type="table">6</ref> shows the efectiveness comparison between variants, Baselines, and UniLoc. According to the table, without search space optimization(S2) and ile ranking optimization(S3) UniLoc worked worse than baselines. Among S2 and S3 approaches, S2 that is build dependency analysis and AST entity use plays the most important role for performance improvement. Among the variants, V1 worked much better than V2, which implies that S2 and S3 are more efective than S1. V2 has less metric values than V1 and V2, meaning that S2 is much more important than the other two strategies. Uses of changed iles in failure-inducing commits are also important for UniLoc's performance improvement.  RQ4: Efectiveness for Diferent Types of Bug Locations The bug locations of CI failures may be in source iles, build scripts, or both types of iles. We further clustered UniLoc's evaluation results among these three kinds of scenarios. As shown in Table <ref type="table">7</ref>, among the 600 evaluated CI failures, 430 failures were ixed by only source code changes, 78 failures were ixed by only modifying build scripts, and 92 failures were ixed by changes in both types of iles. These number already shows the complexity of CI failure ixes. In Table <ref type="table">7</ref>, we can see that all approaches perform better when the ixes are in build scripts or in both types of iles, maybe because there are fewer build script iles than source iles. Furthermore, three baselines perform very diferently in diferent types of ixes, but UniLoc is more balanced and always has the best performance score in terms of MAP and NDCG. As MAP and NDCG metric considers all faulty ile's ranking for performance calculation, UniLoc shows a more robust FL ranking considering all faulty iles. In fact, in the evaluation dataset, the average number of iles modiied to ix CI failures is 6.8 iles per failure, with an average of 7.32 source ile modiication if it requires source-related modiication and an average of 1.48 build script modiication if it involves build-related failure. Although, in most cases build related CI failure requires one or two iles modiication from the average of 9 build script iles, it is always diicult to determine whether any build-script change is required. Furthermore, identifying faulty build scripts is more challenging than identifying faulty source iles due to the high-level abstraction of statements in build scripts, latent dependencies between build scripts and source code and among build scripts themselves, and lack of tool support for build-script debugging.</p><p>Moreover, among 600 CI failure, 347 failures required more than one ile modiication(s) to ix the CI failure. This brings in the necessity of more robust FL tools like UniLoc to localize CI failure root causes. Apart from that, since it is not possible to know the type of ixes in advance, UniLoc's balance and robust performance will help it achieve the expected performance in most cases.</p><p>More surprisingly, Baseline2 (BLUiR) performed better on build script ixes, but Baseline1 (querying ile names in &#8242;</p><p>) performed better on source code ixes, which is diferent from our expectations. Baseline3 (Locus) preformed worst for source related CI failures. After a detailed investigation, we found that, even after our query optimization, the build logs still contain some noises which look similar to AST elements of the build script, so they misled BLUiR even when the actual ixes required are in the source ile(s). This result is consistent with UniLoc without optimizations 2 or 3, as shown in Table <ref type="table">6</ref>. This shows that even better query-optimization techniques are still required for applying IR-based FL approaches to CI scenarios. For build scripts, BLUiR performs better because build-related terms from speciic build script(s) are dominating the build logs, so it simply ranked all of them higher, and thus had higher Top-10 coverage (note that their Top-1 coverage is much lower than UniLoc). But if the ix requires to change in multiple build script with module dependency, then in those cases Baseline 2 cannot perform well. As a result, even for many cases, Top-N (considers one ile only) metric result is better, but in terms of MAP and NDCG metric, Baseline 2 performance is lower than UniLoc. In contrast, Baseline1 performs better on source-only ixes. Our further investigation (presented in <ref type="bibr">Table 8)</ref> shows that its high performance mainly comes from source-only ixes of non-test failures, which are mainly compilation errors and code-style errors. Since ile names are often provided in such non-test failures, it is no wonder that Baseline1 performed very well on them. But in many cases ixing compilation errors and code-style errors require changes in other iles (due to dependency) that are not mentioned in failure log. For those cases, Baseline1 might show promising results in terms of Top-N and MRR metric, but for MAP and NDCG metric Baseline1's performance is lower than others. The analysis also shows that looking at the error log for the faulty ile might not be suicient to solve CI failures, and the approach can sufer from missing fault-related iles in fault localization. For ixes with both ile change types, Baseline2 showed promising results in terms of Top1, Top5, Top10 and MRR metrics. These four matrices consider irst ile matching rather than all faulty ile matching. Among both ile type-related ixes, many of the ixes are related to plug-in, compilation and dependency-related failure where the failed log clariies which ile with coniguration or code entity(s) generated errors. However, ixing those failures, in cases, requires changes in iles with matching entities, and requires changes in some other related iles whose names or entities are absent in the failure log. Baseline2 can rank the irst matching ile in the higher position but cannot identify other related iles due to lack of dependency information and recent change information. Example 5 shows such a build failure, where the build log explicitly mentioned entities related to /sandbox/build.gradle ile; however, while ixing developer changed this ile, as well as four other .java source iles. In this case, Baseline2 can rank /sandbox/build.gradle in a higher position, but the approach cannot prioritize other related iles. As a result, in terms of MAP and NDCG metric, UniLoc outperforms Baseline2 and other approaches. In the case of Baseline3, the tool performed poorly as the approach considers all priors commit history and change hunks for FL. However, prior commits might be related to bug ixes, feature addition, and CI ixes and it cannot diferentiate CI ixes from other kind of code modiications. Since the number of other kinds of modiications(e.g., bug ix, feature addition) are much more frequent than CI ix modiications, in most cases, FL rankings generated by this approach are less relevant to CI failures.   RQ5: Efectiveness of Diferent Types of CI Failures. CI failures can be triggered by test failures or non-test failures. We also clustered UniLoc's evaluation results among these two kinds of failures and presented the results in Table <ref type="table">8</ref>. Table <ref type="table">8</ref> presents the performance of UniLoc and baseline approaches using MRR, MAP and NDCG metrics. Since both Top-N and MRR metric utilizes only the irst matched buggy ile's ranks among all buggy iles to calculate the performance, we believe that MRR relects the performance considering the irst buggy ile match. As a result, we did not present the performance evaluation with Top-1, Top-5 and Top-10 metrics for this analysis. Table <ref type="table">8</ref> shows one important insight that 18.97% of test failure ixes require accompanying changes in the build script. Although prior research works <ref type="bibr">[69,</ref><ref type="bibr">85]</ref> on fault localization identiies that spectrum-based fault localization (SBFL) works better for Test Failures, based on test failure ix statistics (see Table <ref type="table">8</ref>), existing SBFL techniques might not work for test failures that require build script change. Moreover, for SBFL, running these test cases on an instrumented version of the faulty program may not track build script execution traces. Since it is not possible to know the type of ixes in advance, UniLoc does have an advantage over SBFL techniques to localize faults in a balanced way.</p><p>In Table <ref type="table">8</ref>, we can observe that all approaches except Baseline3 performed better on non-test failures than test failures, which is reasonable as the latter can be more complicated and involve more iles. Baseline3 Locus was mainly optimized to detect test failures. However, in many cases in CI environment, test failures happen due to missing dependency or run-time class binding that generates exceptions and can fail a test. As a result, even though Baseline3 performed better for test failures rather than non-test failures, the overall performance of Baseline3 is low. At the same time, for test failures such as Example 6 where the ile name is mentioned in the build log, Baseline1 can ind the irst faulty ile based on the log but cannot identify dependent iles that are also required to ix. To ix Example 6 failure, the developer needs to make changes in six iles, and only one of them is mentioned in the build log. Baseline2 can ind those faulty iles, but its ranking is not optimized due to the lack of build dependency information, as well as change history information. Overall for test failure, UniLoc outperformed all baseline approaches for all the metrics. In terms of NDCG, the improvement over Baseline1 and Baseline2 are 38.77% and 30.61%, respectively.</p><p>Apart from test failures, there can be non-test failures due to coniguration errors, compilation issues, static analyzers (e.g., CheckStyle, Lint), etc. For non-test failures involving only source ix, Baseline1 shows better performance. So we did an analysis for performance improvement and observed that compilation errors and CheckStyle errors are common in this category. In such failures, all or many error iles are mentioned in the build log , so it works better. Example 7 shows such a non-test build failure where the faulty ile is directly mentioned in the build log and the developer made modiications in the mentioned ile. But for the cases where changing a class ile required further changes in its parent class, BaseLine1 cannot localize all the faulty iles. For non-test failures involving build script only, UniLoc outperforms Baseline1 and Baseline2 due to optimized use of build script ASTs. Baseline1 shows a surprisingly good result for non-test failures with both source and build source ile change. So we analyzed the cases where Baseline1 outperforms UniLoc. From our analysis, we observed that among these 61 failures, 27 failures were from the same project (BuildCraft/BuildCraft). Then we analyzed the commits (Example Commit ID:d236d08) of these 27 failures and observed that after each build error ix, the developer updated the version number in root build.gradle ile, which is not related to the build error but related to Checkstyle convention. Since UniLoc optimizes ranking with precise build dependency information and AST optimization, such build.gradle is ranked lower. In contrast, almost always shows up in the build and it is ranked among all iles by default because it is in the root folder and starts with 'b'. Besides, CheckStyle usually reports all the ile names in the build log that violates stylistic rules. So Baseline1 performs very well on these failures. But this type of CI failure is uncommon and project-speciic. Even involving these 27 failures, UniLoc's performance for non-test failures involving both ile type changes is only 3.33% lower than BaseLine1's in terms of NDCG metric. Overall for non-test failure, UniLoc outperformed both BaseLine1 and BaseLine2 in terms of three metrics MRR, MAP and NDCG. The results show that UniLoc performs better than baselines for both types of failures. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">THREATS TO VALIDITY</head><p>One potential internal validity of our evaluation is the ground truth we considered to resolve the CI failures may contain code changes other than build ix like code enhancement, refactoring etc. Actually since changes in one code commit may be dependent on each other a partial commit may not compile or passes tests), there does not exist a clear deinition for the relevant part in the repairing commit. We assumed that the one we extracted from the project repository was the best one, as it was the actual developer ix. To reduce the threat, we only considered CI build instances with failed status to passed status with modiication of build script or source code in one single commit. Prior research eforts <ref type="bibr">[37,</ref><ref type="bibr">55,</ref><ref type="bibr">69]</ref> on FL also used the diference of pre-ix and post-ix code commits as the ground truth for evaluation. Actually, among the 600 evaluated ixes, 253 (42.16%) ixes touched only one ile (so their ground truth is fully precise for our evaluation) and the rest of the 347 (57.83%) ixes touched multiple ile change, so only them might be afected by the threat. The major external threat to our evaluation is that we evaluated UniLoc for only Gradle based project with Java as a programming language. We tried to make our approach generalized so that it can be applied to other build management tools and programming languages. To reduce this threat, we plan to apply our approach to other popular build systems and programming languages. Last of all, our FL technique can identify faults in ile level, which could be coarse-grained for the practical usability of the tool. However, considering the heterogeneity of CI failures and the diferent nature of CI failures, the ile-level FL technique can be helpful for developers. To mitigate such threat, we plan to extend the FL technique to have more ine-grained such as block-level or statement-level localization capability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">DISCUSSIONS</head><p>This section discusses the implications of our results and observations from our study on FL. It further outlines some future research directions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Implications</head><p>Implications of the Dataset Observations. A signiicant amount of research efort has been devoted to CI in identifying barriers to adopt CI and optimize the worklow of CI. Prior study on CI worklow <ref type="bibr">[75]</ref> identiied that increased build complexity is the main reason to abandon CI work process. Hilton et al. <ref type="bibr">[31]</ref> identiied that troubleshooting CI build failures as the top barrier when developers using CI. To overcome such barrier, several research works on CI build failure have been conducted. Vassallo et al. <ref type="bibr">[68]</ref> did an analysis of CI build failure on 349 Java OSS projects and 418 projects from a inancial organization. They categorized CI build failure into 20 categories and identiied that testing failures and release preparation failures are the topmost reasons for CI failure. A recent study <ref type="bibr">[26]</ref> on build breakage data identiied that 33% of the build breakages are due to environmental factors, 29% are due to errors in previous builds, and 9% are due to build jobs. In our study, we also categorized 700 build failures into two broad categories: i) Test Failure and ii) Non-Test Failure. Among these failures, 45% CI failures are test failures and 55% failures are non-test failures. Apart from that 17% of test failure require to have ixed in build script and 24% overall failures need to have ixed in build script. These are important empirical evidence for tool builders to work on the heterogeneity nature of CI failures and the requirement of tool support for source code and build script. Implications of the CI Failure Debugging. Fault Localization has been a widely explored research area over the decades. IRFL-based fault localization techniques <ref type="bibr">[58,</ref><ref type="bibr">71,</ref><ref type="bibr">73]</ref> are mostly using bug reports to ind faulty code locations. While SBFL based techniques <ref type="bibr">[11,</ref><ref type="bibr">79,</ref><ref type="bibr">80]</ref> mostly rely on test case execution results to ind source code fault locations and require instrumentation support. Even after the wide advancement of FL techniques, a prior empirical study on CI <ref type="bibr">[31]</ref> identiied the need for debugging assistance for CI worklow. Moreover, prior studies <ref type="bibr">[20,</ref><ref type="bibr">69]</ref> identiied that SBFL and IR-based fault localization has limited usability in the real-world development worklow due to execution overhead, limited bug context information, inaccurate ranking, etc. Considering the limitations, our proposed approach aligns with the CI worklow and can localize CI faults without instrumentation overhead. Also, UniLoc can localize faults in both source code and build script, which is not supported by prior FL techniques. Our empirical analysis also suggests that UniLoc outperforms existing IR-based FL techniques for test failure and non-test failures. UniLoc can be a more efective tool support to meet the needs of debugging assistance in CI. Our approach was evaluated on real-world CI failures from large-scale open-source projects, suggesting that the UniLoc can be useful in real-world development scenarios. Apart from that, UniLoc is one of the irst in its kind for fault localizing build script, which is limited in the count but diferent than source code in terms of abstraction, domain knowledge and limited tool support for debugging. Considering the heterogeneity of CI failure involving source code and build script and limited support of debugging build script, our proposed approach with IR-based FL in ile-level granularity can be useful for developers to debug CI failure. The work can be the basis for further research on the usability of such tool support and more ine-grained uniied fault localization tools, such as at block level, given that a widely accepted deinition of blocks in build scripts can be developed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Research Directions</head><p>Build Tools Other than Gradle. In this research work we only considered CI failures of projects using Gradle as their build management tool. The major Gradle-speciic part of UniLoc is our build dependency module. Like Gradle, other popular build systems also provide support for multi-module build. Ant <ref type="bibr">[1]</ref> provides multi-module build with dependency information with the help of Ivy <ref type="bibr">[2]</ref>. In Maven <ref type="bibr">[4]</ref>, mechanism to handle multi-module build is called Reactor. With the help of Reactor, Maven can also deine project dependency. Apart from that, Ant and Maven also provide build log with rich source of information like build status, fail information, compilation issue etc. Moreover, Ant and Maven build failures are also available in TravisTorrent dataset. So, our approach can be applied to other build management tools by extending our build coniguration and build script analyzer, as well as be evaluated on the TravisTorrent dataset. IR-based vs. Spectrum-based Fault Localization. Compared with spectrum-based fault localization, IR-based fault localization is often less precise due to the lack of runtime information. In contrast, IR-based approach can be applied without code instrumentation. In the scenario of continuous integration, even if code instrumentation support does exist, it cannot be always turned on due to the high overhead. So due to the urgency of resolving CI stalls, an IR-based approach can be very helpful with the initial assignment of bugs to a proper developer, and the developer's initial investigation. Furthermore, as illustrated in multiple examples in this paper, CI failures often involve multiple types of iles (e.g., source iles, build scripts) and their dependencies. In such cases, code instrumentation on one ile type may miss root causes of failures, while a comprehensive code instrumentation support can be diicult to implement. Apart from that, in some CI practices, CI servers queue multiple commits into a single commit to optimize integration time and testing time <ref type="bibr">[8]</ref>. In those case applying, applying SBFL on smaller sub-commits can be impractical due to resource and time limitation and IR-based approach can be more eicient on smaller sub-commits to identify faulty iles.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">RELATED WORKS 7.1 Automatic Bug Localization</head><p>Automatic bug localization has been an active research area over the decades <ref type="bibr">[41] [84]</ref>. Automatic bug localization techniques can be generally divided into two categories: i) dynamic approaches and ii) static approaches. Dynamic fault localization <ref type="bibr">[53]</ref> requires execution of programs and test cases to identify precise fault location. Dynamic fault localization techniques need pre-processing of the code or underlying platform, as well as precise reproduction of the failure. Among the dynamic fault localization techniques, spectrum based fault localization (SBFL) <ref type="bibr">[11]</ref> is the most prominent technique. SBFL usually depends on suspicion score based on program elements executed by the test cases. Tarantula <ref type="bibr">[33]</ref> is the early research work on SBFL and subsequent researchers are working to improve the accuracy of the localization technique. Ochiai <ref type="bibr">[10]</ref> uses diferent similarity co-eicient to ind more accurate fault localization. Xuan and Monperrius proposed Multric <ref type="bibr">[79]</ref>, which combines learning-to-rank and fault localization techniques for more accurate localization. Savant <ref type="bibr">[16]</ref> uses likely invariant with learning-to-rank algorithm for fault localization. K&#252;&#231;&#252;k et al. <ref type="bibr">[36]</ref> proposed a novel approach that combines the causal inference from code and coverage information. Sarhan et al. <ref type="bibr">[60]</ref> developed a fault localization tool for python based on exisitng spectrum-based approaches. Most recently, Lou et al. <ref type="bibr">[43]</ref> and Li et al. <ref type="bibr">[40]</ref> use representation learning on code dependencies and run-time code coverage to predict the failure-causing statements. Meng et al. <ref type="bibr">[50]</ref> further enhanced these techniques by incorporating knowledge from historical bugs and code from other software projects. Since the software building process lacks test cases and instrumentation of all building scripts in various forms can be challenging, the dynamic localization techniques mentioned above cannot be easily applied to faults in build scripts and coniguration iles.</p><p>Static fault localization techniques do not require test cases and execution information. Static fault localization depends on static source code analysis <ref type="bibr">[83]</ref> [23] or information retrieval based approaches <ref type="bibr">[69] [84]</ref>. Lint <ref type="bibr">[32]</ref> is one of the irst tool to ind fault in C programs. FindBug <ref type="bibr">[15]</ref> and PMD <ref type="bibr">[5]</ref> are prominent static code analyzer for Java source code. Lots of IR-based approaches <ref type="bibr">[84]</ref>  <ref type="bibr">[58]</ref> have been proposed by the researchers for fault localization. BugLocator <ref type="bibr">[84]</ref> performs bug localization based on revised VSM model. Saha et al. <ref type="bibr">[58]</ref> proposed BLUiR considering source code structure for IR-based fault localization. Recent work on fault localization Locus <ref type="bibr">[73]</ref> utilizes change history for fault localization. Since static fault localization does not require execution environment and test cases, we applied IR-based fault localization technique for build fault localization. In our approach, we adopted build script analysis, source code AST and also recent change history for locating build fault from build log information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2">Fault Localization Supporting Automatic Program Repair</head><p>Over the last few years Automatic Program Repair <ref type="bibr">[34]</ref> [27] is gaining popularity. GenProg <ref type="bibr">[27]</ref> uses Genetic Programming(GP) for automatic patch generation. RSRepair <ref type="bibr">[54]</ref> performs random searching for generating a path. To reduce searching from existing code, PAR <ref type="bibr">[34]</ref> uses predeined ix patterns to generate a patch for a new bug. Apart from search-based or template-based patch generation, machine learning and probabilistic models are also getting popularity for automatic program repair. Prophet <ref type="bibr">[42]</ref> uses a probabilistic model to generate a new patch. Van Tonder and Le Goues <ref type="bibr">[66]</ref> applied separation logic for automatic program generation. While Wen et al. <ref type="bibr">[72]</ref> used AST context information for better program repair. Automatic program repair techniques are also getting popular for automatic build repair. Recently Macho et al. <ref type="bibr">[46]</ref> proposed BUILDMEDIC to repair Maven dependency failure. HireBuild <ref type="bibr">[30]</ref> uses a history driven approach for Gradle build script repair. For all these automatic repair works, one of an integral part of the repair is fault localization. As discussed in 7.1, there are diferent approaches for bug localization. For automatic build repair, previous works consider only build script for their repair target. But build failure can be generated for source code, build script or both. So, apart from assisting developers for ixing build failure, build fault localization can be useful for automatic build repair research work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3">Build Script Analysis</head><p>With the growing popularity of build management tools and automatic build scripts, analysis of build script is also getting importance for software engineering research areas such as build repair, fault localization, build target decomposition, migration of build coniguration, etc. For build dependency analysis, Gunter <ref type="bibr">[14]</ref> proposed a Petri-net based model. Adams et al. <ref type="bibr">[12]</ref> proposed re(verse)-engineering framework MAKAO to keep build consistency in change revisions. MAKAO extracts dependency from build traces to generate build consistency. Recently Wen et al. <ref type="bibr">[74]</ref> proposed BLIMP for build change impact analysis generated from the build dependency graph. Xia et al. <ref type="bibr">[78]</ref> proposed a machine learning based model to predict build co-changes. While from source code change history, Macho et al. <ref type="bibr">[45]</ref> proposed model to predict build coniguration changes. McIntosh et al. <ref type="bibr">[49]</ref> performed a large study to ind relation in between build maintenance and build technology. SYMake <ref type="bibr">[63]</ref> uses a symbolic-evaluation based technique to detect common errors in Make iles. To improve software build process, McIntosh et al. <ref type="bibr">[48]</ref> did a study on header ile hotspots. On the study of building errors, Hassan et al. <ref type="bibr">[28]</ref> performed empirical analysis on build failure hierarchy.</p><p>The most closely related work is fault localization of Make build script proposed by Al-Kofahi et al. <ref type="bibr">[13]</ref>. They proposed MkFault to generate suspiciousness scores of Make statement for a build error. MkFault instrumented code to generate build traces. But in CI environment, instrumenting large code base might be costly in terms of time and resource. Apart from that MkFault only considers Make build script as source of build failure. But our analysis on real build error ix inds that build error can happen due to source code, build script or both. We also performed evaluation of our approach on a large dataset with diferent project coniguration. Recently Sharma et al. <ref type="bibr">[61]</ref> proposed an approach to identify bad smells in coniguration iles.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">CONCLUSION AND FUTURE WORK</head><p>Most existing approaches(e.g., Locus <ref type="bibr">[73]</ref>, BRTracer <ref type="bibr">[76]</ref>, BLUiR <ref type="bibr">[58]</ref>) in fault localization focus on test failures or bug reports and source code or other single type of iles for fault localization. By contrast, there are much less research on the fault localization of build scripts and repair. A more realistic scenario in practice is that multiple types of failures happen simultaneously and can be ascribed to multiple types of iles. Our analysis of localization CI failure with BLUiR and Locus suggests that the approaches can localize CI failures to a certain extent but are not fully optimized to localize CI failures. In this research work, we proposed the irst uniied fault localization approach that considers both source code and build script to localize the repair for continuous integration. Our approach works on top of classical IR-based approach with query and search space optimization based on build coniguration and CI log analysis, and generates suspicion ranking of faulty iles including both source code and build script. Our evaluation on 600 real CI failure shows that UniLoc can localize faulty iles with MRR as 0.49, MAP as 0.36 and NDCG as 0.54, which outperforms baseline approaches for all types of failures.</p><p>In the future, we plan to implement ile level build dependency graphs to ilter out irrelevant iles in search space. File-based build dependency graph with change history might help us reduce search space dramatically. Apart from that, we plan to apply more advanced IR-based searching approaches to ind better ranking. Our experiment results show that query optimization is still a key challenge of applying IR-based FL approaches to CI scenarios, so we plan to develop more advanced techniques on query optimization. Moreover, we are planning to expand our fault localization approach to the source code and build script block level to assist developers and automatic repair approaches better. Finally, beyond source code and build scripts there are also other types of iles to be involved during software repair, especially in other scenarios. For example, in the fault localization of web applications, we need to consider html iles, css iles, client-side JavaScript iles and server side source code. We plan to adapt our technique to more scenarios with heterogeneous bug locations.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>ACM Trans. Softw. Eng. Methodol. UniLoc: Unified Fault Localization of Continuous Integration Failures &#8226; 5</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>ACM Trans. Softw. Eng. Methodol.</p></note>
		</body>
		</text>
</TEI>
