skip to main content


Title: Generating and Analyzing Program Call Graphs using Ontology
Call graph or caller-callee relationships have been used for various kinds of static program analysis, performance analysis and profiling, and for program safety or security analysis such as detecting anomalies of program execution or code injection attacks. However, different tools generate call graphs in different formats, which prevents efficient reuse of call graph results. In this paper, we present an approach of using ontology and resource description framework (RDF) to create knowledge graphs for specifying call graphs to facilitate the construction of full-fledged and complex call graphs of computer programs, realizing more interoperable and scalable program analyses than conventional approaches. We create a formal ontology-based specification of call graph information to capture concepts and properties of both static and dynamic call graphs so different tools can collaboratively contribute to more comprehensive analysis results. Our experiments show that ontology enables merging of call graphs generated from different tools and flexible queries using a standard query interface.  more » « less
Award ID(s):
2015254
NSF-PAR ID:
10394963
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2022 IEEE/ACM Workshop on Programming and Performance Visualization Tools (ProTools)
Page Range / eLocation ID:
11 to 20
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Call graphs have many applications in software engineering, including bug-finding, security analysis, and code navigation in IDEs. However, the construction of call graphs requires significant investment in program analysis infrastructure. An increasing number of programming languages compile to the Java Virtual Machine (JVM), and program analysis frameworks such as WALA and SOOT support a broad range of program analysis algorithms by analyzing JVM bytecode. This approach has been shown to work well when applied to bytecode produced from Java code. In this paper, we show that it also works well for diverse other JVM-hosted languages: dynamically-typed functional Scheme, statically-typed object-oriented Scala, and polymorphic functional OCaml. Effectively, we get call graph construction for these languages for free, using existing analysis infrastructure for Java, with only minor challenges to soundness. This, in turn, suggests that bytecode-based analysis could serve as an implementation vehicle for bug-finding, security analysis, and IDE features for these languages. We present qualitative and quantitative analyses of the soundness and precision of call graphs constructed from JVM bytecodes for these languages, and also for Groovy, Clojure, Python, and Ruby. However, we also show that implementation details matter greatly. In particular, the JVM-hosted implementations of Groovy, Clojure, Python, and Ruby produce very unsound call graphs, due to the pervasive use of reflection, invokedynamic instructions, and run-time code generation. Interestingly, the dynamic translation schemes employed by these languages, which result in unsound static call graphs, tend to be correlated with poor performance at run time. 
    more » « less
  2. null (Ed.)
    Android applications rely heavily on strings for sensitive operations like reflection, access to system resources, URL connections, database access, among others. Thus, insight into application behavior can be gained through not only an analysis of what strings an application creates but also the structure of the computation used to create theses strings, and in what manner are these strings used. In this paper we introduce a static analysis of Android applications to discover strings, how they are created, and their usage. The output of our static analysis contains all of this information in the form of a graph which we call a string computation. We leverage the results to classify individual application behavior with respect to malicious or benign intent. Unlike previous work that has focused only on extraction of string values, our approach leverages the structure of the computation used to generate string values as features to perform classification of Android applications. That is, we use none of the static analysis computed string values, rather using only the graph structures of created strings to do classification of an arbitrary Android application as malware or benign. Our results show that leveraging string computation structures as features can yield precision and recall rates as high as 97% on modern malware. We also provide baseline results against other malware detection tools and techniques to classify the same corpus of applications. 
    more » « less
  3. Knowledge graphs gained popularity in recent years and have been useful for concept visualization and contextual information retrieval in various applications. However, constructing a knowledge graph by scraping long and complex unstructured texts for a new domain in the absence of a well-defined ontology or an existing labeled entity-relation dataset is difficult. Domains such as cybersecurity education can harness knowledge graphs to create a student-focused interactive and learning environment to teach cybersecurity. Learning cybersecurity involves gaining the knowledge of different attack and defense techniques, system setup and solving multi-facet complex real-world challenges that demand adaptive learning strategies and cognitive engagement. However, there are no standard datasets for the cybersecurity education domain. In this research work, we present a bottom-up approach to curate entity-relation pairs and construct knowledge graphs and question-answering models for cybersecurity education. To evaluate the impact of our new learning paradigm, we conducted surveys and interviews with students after each project to find the usefulness of bot and the knowledge graphs. Our results show that students found these tools informative for learning the core concepts and they used knowledge graphs as a visual reference to cross check the progress that helped them complete the project tasks. 
    more » « less
  4. null (Ed.)
    Many program analyses need to reason about pairs of matching actions, such as call/return, lock/unlock, or set-field/get-field. The family of Dyck languages { D k }, where D k has k kinds of parenthesis pairs, can be used to model matching actions as balanced parentheses. Consequently, many program-analysis problems can be formulated as Dyck-reachability problems on edge-labeled digraphs. Interleaved Dyck-reachability (InterDyck-reachability), denoted by D k ⊙ D k -reachability, is a natural extension of Dyck-reachability that allows one to formulate program-analysis problems that involve multiple kinds of matching-action pairs. Unfortunately, the general InterDyck-reachability problem is undecidable. In this paper, we study variants of InterDyck-reachability on bidirected graphs , where for each edge ⟨ p , q ⟩ labeled by an open parenthesis ”( a ”, there is an edge ⟨ q , p ⟩ labeled by the corresponding close parenthesis ”) a ”, and vice versa . Language-reachability on a bidirected graph has proven to be useful both (1) in its own right, as a way to formalize many program-analysis problems, such as pointer analysis, and (2) as a relaxation method that uses a fast algorithm to over-approximate language-reachability on a directed graph. However, unlike its directed counterpart, the complexity of bidirected InterDyck-reachability still remains open. We establish the first decidable variant (i.e., D 1 ⊙ D 1 -reachability) of bidirected InterDyck-reachability. In D 1 ⊙ D 1 -reachability, each of the two Dyck languages is restricted to have only a single kind of parenthesis pair. In particular, we show that the bidirected D 1 ⊙ D 1 problem is in PTIME. We also show that when one extends each Dyck language to involve k different kinds of parentheses (i.e., D k ⊙ D k -reachability with k ≥ 2), the problem is NP-hard (and therefore much harder). We have implemented the polynomial-time algorithm for bidirected D 1 ⊙ D 1 -reachability. D k ⊙ D k -reachability provides a new over-approximation method for bidirected D k ⊙ D k -reachability in the sense that D k ⊙ D k -reachability can first be relaxed to bidirected D 1 ⊙ D 1 -reachability, and then the resulting bidirected D 1 ⊙ D 1 -reachability problem is solved precisely. We compare this D 1 ⊙ D 1 -reachability-based approach against another known over-approximating D k ⊙ D k -reachability algorithm. Surprisingly, we found that the over-approximation approach based on bidirected D 1 ⊙ D 1 -reachability computes more precise solutions, even though the D 1 ⊙ D 1 formalism is inherently less expressive than the D k ⊙ D k formalism. 
    more » « less
  5. Researchers have reported that static analysis tools rarely achieve a false-positive rate that would make them attractive to developers. We overcome this problem by a technique that leads to reporting fewer bugs but also much fewer false positives. Our technique prunes the static call graph that sits at the core of many static analyses. Specifically, static call-graph construction proceeds as usual, after which a call-graph pruner removes many false-positive edges but few true edges. The challenge is to strike a balance between being aggressive in removing false-positive edges but not so aggressive that no true edges remain. We achieve this goal by automatically producing a call-graph pruner through an automatic, ahead-of-time learning process. We added such a call-graph pruner to a software tool for null-pointer analysis and found that the false-positive rate decreased from 73% to 23%. This improvement makes the tool more useful to developers. 
    more » « less