<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Faster Depth-First Subgraph Matching on GPUs</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>04/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10498838</idno>
					<idno type="doi"></idno>
					<title level='j'>40th IEEE International Conference on Data Engineering (ICDE)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Lyuheng Yuan</author><author>Da Yan</author><author>Jiao Han</author><author>Akhlaque Ahmad</author><author>Yang Zhou</author><author>Zhe Jiang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Subgraph search problems such as maximal clique enumeration and subgraph matching generate a search-space tree which is traversed in depth-first manner by serial backtracking algorithms that are recursive. Since Jenkins et al. reported the backtracking paradigm to be sub-optimal for GPU acceleration, breadth-first traversal of the search-space tree is widely adopted by GPU algorithms. However, they produce a lot of intermediate subgraphs that exhaust the GPU device memory. Recent works revive the depth-first backtracking paradigm for GPU acceleration, where each warp is a basic processing unit with its own stack in device memory for subgraph backtracking. However, they adopt complicated methods for load balancing that incur a lot of overheads. They also use hardcoded fixed space for stacks that is determined ad-hoc and may lead to inaccuracy when the allocated space is insufficient.In this paper, we use subgraph matching as a case study to propose novel depth-first GPU solutions to address the above problems. Our approach, called T-DFS, decomposes the compu- tation into independent tasks that process search-space subtrees, which are managed by an efficient lock-free circular task queue. Tasks are distributed to different warps for parallel processing, and a novel timeout mechanism is used to eliminate straggler tasks to ensure load balancing. We also support flexible and fine- grained dynamic memory allocation for stack spaces to avoid the stack space allocation pitfalls of existing works. Extensive experi- ments on real graphs show that T-DFS significantly outperforms existing depth-first GPU solutions for the subgraph matching application.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body><div/></body></text>
</TEI>
