NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Systemizing Interprocedural Static Analysis of Large-scale Systems Code with Graspan

https://doi.org/10.1145/3466820

Zuo, Zhiqiang; Wang, Kai; Hussain, Aftab; Sani, Ardalan Amiri; Zhang, Yiyu; Lu, Shenming; Dou, Wensheng; Wang, Linzhang; Li, Xuandong; Wang, Chenxi; et al (July 2021, ACM Transactions on Computer Systems)
null (Ed.)
There is more than a decade-long history of using static analysis to find bugs in systems such as Linux. Most of the existing static analyses developed for these systems are simple checkers that find bugs based on pattern matching. Despite the presence of many sophisticated interprocedural analyses, few of them have been employed to improve checkers for systems code due to their complex implementations and poor scalability. In this article, we revisit the scalability problem of interprocedural static analysis from a “Big Data” perspective. That is, we turn sophisticated code analysis into Big Data analytics and leverage novel data processing techniques to solve this traditional programming language problem. We propose Graspan , a disk-based parallel graph system that uses an edge-pair centric computation model to compute dynamic transitive closures on very large program graphs. We develop two backends for Graspan, namely, Graspan-C running on CPUs and Graspan-G on GPUs, and present their designs in the article. Graspan-C can analyze large-scale systems code on any commodity PC, while, if GPUs are available, Graspan-G can be readily used to achieve orders of magnitude speedup by harnessing a GPU’s massive parallelism. We have implemented fully context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases written in multiple languages such as Linux and Apache Hadoop demonstrates that their Graspan implementations are language-independent, scale to millions of lines of code, and are much simpler than their original implementations. Moreover, we show that these analyses can be used to uncover many real-world bugs in large-scale systems code.
more » « less
Full Text Available
Systemizing Interprocedural Static Analysis of Large-Scale Systems Code with Graspan

Zuo, Zhiqiang; Wang, Kai; Hussain, Aftab; Amiri Sani, Ardalan; Zhang, Yiyu; Lu, Shenming; Dou, Wensheng; Wang, Linzhang; Li, Xuandong; Wang, Chenxi; et al (January 2021, ACM transactions on computer systems)
null (Ed.)
Full Text Available
DistStream: An Order-Aware Distributed Framework for Online-Offline Stream Clustering Algorithms

Xu, Lijie; Ye, Xingtong; Kang, Kai; Guo, Tian; Dou, Wensheng; Wang, Wei; Wei, Jun (January 2020, 40th IEEE International Conference on Distributed Computing Systems (ICDCS'20))

Stream clustering is an important data mining technique to capture the evolving patterns in real-time data streams. Today’s data streams, e.g., IoT events and Web clicks, are usually high-speed and contain dynamically-changing patterns. Existing stream clustering algorithms usually follow an online-offline paradigm with a one-record-at-a-time update model, which was designed for running in a single machine. These stream clustering algorithms, with this sequential update model, cannot be efficiently parallelized and fail to deliver the required high throughput for stream clustering. In this paper, we present DistStream, a distributed framework that can effectively scale out online-offline stream clustering algorithms. To parallelize these algorithms for high throughput, we develop a mini-batch update model with efficient parallelization approaches. To maintain high clustering quality, DistStream’s mini-batch update model preserves the update order in all the computation steps during parallel execution, which can reflect the recent changes for dynamically-changing streaming data. We implement DistStream atop Spark Streaming, as well as four representative stream clustering algorithms based on DistStream. Our evaluation on three real-world datasets shows that DistStream-based stream clustering algorithms can achieve sublinear throughput gain and comparable (99%) clustering quality with their single-machine counterparts.
more » « less
Full Text Available
An Experimental Evaluation of Garbage Collectors on Big Data Applications

Xu, Lijie; Guo, Tian; Dou, Wensheng; Wang, Wei; Wei, Jun (January 2019, The 45th International Conference on Very Large Data Bases (VLDB'19))

Popular big data frameworks, ranging from Hadoop MapReduce to Spark, rely on garbage-collected languages, such as Java and Scala. Big data applications are especially sensitive to the effectiveness of garbage collection (i.e., GC), because they usually process a large volume of data objects that lead to heavy GC overhead. Lacking in-depth understanding of GC performance has impeded performance improvement in big data applications. In this paper, we conduct the first comprehensive evaluation on three popular garbage collectors, i.e., Parallel, CMS, and G1, using four representative Spark applications. By thoroughly investigating the correlation between these big data applications’ memory usage patterns and the collectors’ GC patterns, we obtain many findings about GC inefficiencies. We further propose empirical guidelines for application developers, and insightful optimization strategies for designing big-data-friendly garbage collectors.
more » « less
Full Text Available

Search for: All records