skip to main content


Title: Run-Time Prevention of Software Integration Failures of Machine Learning APIs

Due to the under-specified interfaces, developers face challenges in correctly integrating machine learning (ML) APIs in software. Even when the ML API and the software are well designed on their own, the resulting application misbehaves when the API output is incompatible with the software. It is desirable to have an adapter that converts ML API output at runtime to better fit the software need and prevent integration failures. In this paper, we conduct an empirical study to understand ML API integration problems in real-world applications. Guided by this study, we present SmartGear, a tool that automatically detects and converts mismatching or incorrect ML API output at run time, serving as a middle layer between ML API and software. Our evaluation on a variety of open-source applications shows that SmartGear detects 70% incompatible API outputs and prevents 67% potential integration failures, outperforming alternative solutions.

 
more » « less
Award ID(s):
1956180
PAR ID:
10542333
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
Proceedings of the ACM on Programming Languages
Volume:
7
Issue:
OOPSLA2
ISSN:
2475-1421
Page Range / eLocation ID:
264 to 291
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. ML APIs have greatly relieved application developers of the burden to design and train their own neural network models—classifying objects in an image can now be as simple as one line of Python code to call an API. However, these APIs offer the same pre-trained models regardless of how their output is used by different applications. This can be suboptimal as not all ML inference errors can cause application failures, and the distinction between inference errors that can or cannot cause failures varies greatly across applications. To tackle this problem, we first study 77 real-world applications, which collectively use six ML APIs from two providers, to reveal common patterns of how ML API output affects applications' decision processes. Inspired by the findings, we propose ChameleonAPI, an optimization framework for ML APIs, which takes effect without changing the application source code. ChameleonAPI provides application developers with a parser that automatically analyzes the application to produce an abstract of its decision process, which is then used to devise an application-specific loss function that only penalizes API output errors critical to the application. ChameleonAPI uses the loss function to efficiently train a neural network model customized for each application and deploys it to serve API invocations from the respective application via existing interface. Compared to a baseline that selects the best-of-all commercial ML API, we show that ChameleonAPI reduces incorrect application decisions by 43%. 
    more » « less
  2. Feldt, Robert ; Zimmermann, Thomas ; Basili, Victor R ; Briand, Lionel C (Ed.)
    Recent work has shown that Machine Learning (ML) programs are error-prone and called for contracts for ML code. Contracts, as in the design by contract methodology, help document APIs and aid API users in writing correct code. The question is: what kinds of contracts would provide the most help to API users? We are especially interested in what kinds of contracts help API users catch errors at earlier stages in the ML pipeline. We describe an empirical study of posts on Stack Overflow of the four most often-discussed ML libraries: TensorFlow, Scikit-learn, Keras, and PyTorch. For these libraries, our study extracted 413 informal (English) API specifications. We used these specifications to understand the following questions. What are the root causes and effects behind ML contract violations? Are there common patterns of ML contract violations? When does understanding ML contracts require an advanced level of ML software expertise? Could checking contracts at the API level help detect the violations in early ML pipeline stages? Our key findings are that the most commonly needed contracts for ML APIs are either checking constraints on single arguments of an API or on the order of API calls. The software engineering community could employ existing contract mining approaches to mine these contracts to promote an increased understanding of ML APIs. We also noted a need to combine behavioral and temporal contract mining approaches. We report on categories of required ML contracts, which may help designers of contract languages. 
    more » « less
  3. This artifact contains the source code for FlakeRake, a tool for automatically reproducing timing-dependent flaky-test failures. It also includes raw and processed results produced in the evaluation of FlakeRake

     

    Contents:

     

    Timing-related APIs that FlakeRake considers adding sleeps at: timing-related-apis

    Anonymized code for FlakeRake (not runnable in its anonymized state, but included for reference; we will publicly release the non-anonymized code under an open source license pending double-blind review): flakerake.tgz

    Failure messages extracted from the FlakeFlagger dataset: 10k_reruns_failures_by_test.csv.gz 

    Output from running isolated reruns on each flaky test in the FlakeFlager dataset: 10k_isolated_reruns_all_results.csv.gz (all test results summarized into a CSV), 10k_isolated_reruns_failures_by_test.csv.gz (CSV including just test failures, including failure messages), 10k_isolated_reruns_raw_results.tgz (includes all raw results from reruns, including the XML files output by maven)

    Output from running the FlakeFlagger replication study (non-isolated 10k reruns):flakeFlaggerReplResults.csv.gz (all test results summarized into a CSV), 10k_reruns_failures_by_test.csv.gz (CSV including just failures, including failure messages), flakeFlaggerRepl_raw_results.tgz (includes all raw results from reruns, including the XML files output by maven - this file is markedly larger than the 10k isolated reruns results because we ran *all* tests in this experiment, whereas the 10k isolated rerun experiment only re-ran the tests that were known to be flaky from the FlakeFlagger dataset).

    Output from running FlakeRake on each flaky test in the FlakeFlagger dataset:

    For bisection mode: results-bis.tgz

    For one-by-one mode: results-obo.tgz

    Scripts used to execute FlakeRake using an HPC cluster: execution-scripts.tgz
    Scripts used to execute rerun experiments using an HPC cluster: flakeFlaggerReplScripts.tgz
    Scripts used to parse the "raw" maven test result XML files in this artifact into the CSV files contained in this artifact: parseSurefireXMLs.tgz 

    Output from running FlakeRake in “reproduction” mode, attempting to reproduce each of the failures that matched the FlakeFlagger dataset (collected for bisection mode only): results-repro-bis.tgz

    Analysis of timing-dependent API calls in the failure inducing configurations that matched FlakeFlagger failures: bis-sleepyline.cause-to-matched-fail-configs-found.csv

     
    more » « less
  4. Software organizations are increasingly incorporating machine learning (ML) into their product offerings, driving a need for new data management tools. Many of these tools facilitate the initial development of ML applications, but sustaining these applications post-deployment is difficult due to lack of real-time feedback (i.e., labels) for predictions and silent failures that could occur at any component of the ML pipeline (e.g., data distribution shift or anomalous features). We propose a new type of data management system that offers end-to-end observability , or visibility into complex system behavior, for deployed ML pipelines through assisted (1) detection, (2) diagnosis, and (3) reaction to ML-related bugs. We describe new research challenges and suggest preliminary solution ideas in all three aspects. Finally, we introduce an example architecture for a "bolt-on" ML observability system, or one that wraps around existing tools in the stack. 
    more » « less
  5. Software system security gets a lot of attention from the industry for its crucial role in protecting private resources. Typically, users access a system’s services via an application programming interface (API). This API must be protected to prevent unauthorized access. One way that developers deal with this challenge is by using role-based access control where each entry point is associated with a set of user roles. However, entry points may use the same methods from lower layers in the application with inconsistent permissions. Currently, developers use integration or penetration testing which demands a lot of effort to test authorization inconsistencies. This paper proposes an automated method to test role-based access control in enterprise applications. Our method verifies inconsistencies within the application using authorization role definitions that are associated with the API entry points. By analyzing the method calls and entity accesses on subsequent layers, inconsistencies across the entire application can be extracted. We demonstrate our solution in a case study and discuss our preliminary results. 
    more » « less