<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>RayJoin: Fast and Precise Spatial Join</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>05/30/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10514505</idno>
					<idno type="doi">10.1145/3650200.3656610</idno>
					<title level='j'>https://doi.org/10.1145/3650200.3656610</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Liang Geng</author><author>Rubao Lee</author><author>Xiaodong Zhang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Real-time spatial data analysis is a fundamental requirement for many critical applications in this digital era. However, such a requirement in practice is often hindered by the low performance of spatial join queries on conventional parallel systems. Specifically, all existing spatial join methods, including both classical plane sweeping algorithms and various grid-or tree-based index-assisted algorithms, have unacceptably long execution times and thus fail to deliver real-time performance. In this paper, we present RayJoin, a new and effective approach that utilizes the ray tracing hardware in modern GPUs (e.g., NVIDIA RT Cores) as accelerators to overcome the bottlenecks in spatial join processing and push the performance to an unprecedented level. Specifically, RayJoin consists of a highperformance and high-precision spatial join framework that accelerates two vital spatial join queries: line segment intersection (LSI) and point-in-polygon test (PIP). Besides these ray tracing-backed algorithms, RayJoin also contains new solutions to address two challenging technical issues: (1) how to meet the high precision requirement of spatial data analysis with the insufficient precision support by the underlying hardware, and (2) how to reduce the high buildup cost of the hardware-accelerated index, namely Bounding Volume Hierarchy (BVH), while maintaining optimal query performance. Our evaluation results show that RayJoin achieves speedups from 3.0x to 28.3x over any existing highly optimized methods in high precision. To the best of our knowledge, RayJoin stands as the sole solution capable of meeting the real-time requirements of diverse workloads, taking under 460ms to join millions of polygons.
CCS CONCEPTS• Information systems → Join algorithms; Spatial-temporal systems; • Computing methodologies → Ray tracing.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Spatial join queries are pivotal in Geographic Information Science (GIS) and geo-databases. A spatial join query aims to identify pairs of geometries from multiple datasets that satisfy a given predicate, such as "intersect" or "contain. " <ref type="bibr">[7,</ref><ref type="bibr">32,</ref><ref type="bibr">45]</ref>. For example, it can determine the locations of bridges by querying intersections between a road network and a hydrographic map. However, the increasing volume of GIS data collected from various sources, including satellites and mobile devices, poses challenges for timely data analytics <ref type="bibr">[4,</ref><ref type="bibr">10,</ref><ref type="bibr">37,</ref><ref type="bibr">51]</ref>. This requirement is particularly acute in domains like interactive urban planning, emergency management, and risk assessment <ref type="bibr">[58,</ref><ref type="bibr">72,</ref><ref type="bibr">79]</ref>.</p><p>Various techniques have been explored to address the challenges of large-scale data and low-latency processing. The plane-sweep algorithm is favored for its low complexity <ref type="bibr">[12,</ref><ref type="bibr">47,</ref><ref type="bibr">66]</ref> but suffers from limited parallel processing capabilities <ref type="bibr">[5,</ref><ref type="bibr">6,</ref><ref type="bibr">41]</ref>. Rasterizationbased methods offer parallelism but require intensive preprocessing <ref type="bibr">[16,</ref><ref type="bibr">65,</ref><ref type="bibr">79]</ref>. Learned spatial indexes, though efficient in query performance and storage, struggle with high training costs and accuracy concerns <ref type="bibr">[18,</ref><ref type="bibr">29,</ref><ref type="bibr">52,</ref><ref type="bibr">53,</ref><ref type="bibr">59,</ref><ref type="bibr">82]</ref>. Consequently, GIS researchers and practitioners often resort to conventional indexing techniques, such as grids and trees, which offer numerous advantages <ref type="bibr">[4, 6, 10, 17, 62-64, 71, 73-77, 80]</ref>.</p><p>However, the conventional spatial indexes mentioned above have performance limitations from achieving real-time processing <ref type="bibr">[61]</ref>. While tree-based methods offer a unique advantage in logarithmic time complexity, they suffer from inherent shortcomings, including branches, irregular memory accesses, and memory space overhead for storing pointers. These drawbacks notably undermine the efficiency of spatial join operations, particularly on GPUs. In contrast, grids convert spatial queries into hardware-friendly scan operations within a limited search space, thereby maximizing memory access efficiency. However, skewed data distributions may result in imbalanced workloads that hinder the effectiveness of parallel processing <ref type="bibr">[43,</ref><ref type="bibr">44,</ref><ref type="bibr">69]</ref>.</p><p>Over the years, algorithmic improvements in conventional indexing methods seriously mismatch with the execution models of general-purpose computer architecture, including CPUs and GPUs, leading to persistent performance bottlenecks and associated tradeoffs in designing databases and analytical systems <ref type="bibr">[13,</ref><ref type="bibr">31,</ref><ref type="bibr">78,</ref><ref type="bibr">81]</ref>. In fact, our evaluation shows that none of the existing spatial join methods satisfy the real-time processing requirement ( &#167;5.7), defined as the completion time within a second. Recently, the GPU advancement of dedicated ray-tracing units, such as NVIDIA's RT Cores, opens a new opportunity to accelerate spatial join queries through hardware-accelerated Bounding Volume Hierarchy (BVH) tree traversal. Moreover, the processing capability of RT Cores is doubled per generation, making RT-based methods automatically benefit from the advancing of this new hardware <ref type="bibr">[48,</ref><ref type="bibr">49]</ref>. Utilizing the technology advancement, we have designed and implemented a new and effective acceleration engine for spatial joins with RT Cores, which is called RayJoin, to address the limitations of all existing approaches.</p><p>Leveraging RT Cores for spatial join queries presents three nontrivial challenges for us to address. First, to harness the dedicated hardware effectively, it is crucial to devise an efficient method that formulates spatial join queries as ray tracing problems. In RayJoin, we showcase two critical spatial join queries, line segment intersection (LSI) and point-in-polygon (PIP) test. We transform these two queries into RT-friendly algorithms using the &#119860;&#119899;&#119910;&#119867;&#119894;&#119905; and &#119862;&#119897;&#119900;&#119904;&#119890;&#119904;&#119905;&#119867;&#119894;&#119905; shaders supported by RT Cores. Specifically, we have leveraged these queries to develop a polygon overlay application. Second, the precision in RT Cores is limited to single-precision floating-point (FP32) for high-performance rendering. However, this limitation poses a significant challenge in meeting the requirements of GIS applications. To address this, we have proposed a conservative representation that stores polygons in low precision while still delivering exact query results. In computer arithmetic, a conservative method ensures the precision of computation results, even at a cost of performance. RayJoin achieves both high precision and high performance. Finally, the BVH construction cost, in both time and space, is another bottleneck to be resolved to achieve high performance. We have noticed that the construction time is often a magnitude higher than the query time. In response, we have introduced a grouping technique that greatly reduces the construction cost by reducing the number of primitives used in the BVH. This effectively reduces construction costs while keeping high query performance.</p><p>Our experiments show that RayJoin has achieved 3.0x to 28.3x speedups over any existing polygon overlay solutions. It only takes about 460ms on the full-fledged polygon overlay analysis to join 1.6M polygons with 303K polygons, meeting real-time spatial join query requirements. In this paper, we have made a strong case for using RT Cores to accelerate various spatial queries in geodatabases, and our contributions are as follows:</p><p>&#8226; We have developed an effective approach that utilizes RT Cores to accelerate LSI and PIP queries, substantially outperforming conventional spatial join techniques on both CPUs and GPUs. To the best of our knowledge, RayJoin is the fastest spatial join solution ( &#167;3.1). &#8226; Within the existing GPUs that only support FP32 ray-tracing, we have developed a method to effectively achieve doubleprecision floating-point (FP64) while maintaining performance levels similar to the underlying hardware ( &#167;3.2). &#8226; We have proposed an algorithm that greatly reduces the BVH construction time and memory cost while keeping high query performance ( &#167;3.3). &#8226; We have comprehensively compared and analyzed RayJoin with many existing spatial join solutions by a consistent benchmark with synthetic and real-world datasets ( &#167;5), making a strong case for RT Cores to attain fast and precise spatial join. &#8226; RayJoin is an open-source software project for widespread public usage.</p><p>1 1 Source code: <ref type="url">https://github.com/pwrliang/RayJoin</ref> 2 BACKGROUND 2.1 Spatial Join Queries and Spatial Indexes Spatial join finds pairs of spatial objects (&#119903;, &#119904;) from two datasets &#119877; and &#119878; that satisfy a spatial predicate &#120579; , where &#119903; &#8712; &#119877;, &#119904; &#8712; &#119878;. Formally, &#119877; &#8882;&#8883; &#120579; &#119878; {(&#119903;, &#119904;) | &#119903; &#8712; &#119877;, &#119904; &#8712; &#119878;, &#120579; (&#119903;, &#119904;) = &#119905;&#119903;&#119906;&#119890;}</p><p>Spatial objects come in several forms, including points, line segments, polylines, and polygons. Predicates, such as intersect, contain, overlap, etc., can be combined to perform complex spatial queries. Figure <ref type="figure">1</ref> provides an example of joining two sets of polygons &#119877; and &#119878;, which is also called polygon overlay analysis<ref type="foot">foot_0</ref> . The polygon overlay is achieved by combining the results of LSI and PIP queries. The LSI query generates points at the intersections where the boundaries of datasets &#119877; and &#119878; meet. The PIP query determines the polygon &#119903; &#8712; &#119877; where a query point &#119904; &#8712; &#119878; lies and vice versa. The output from the above two steps is combined to obtain the resulting polygon (on the right of Figure <ref type="figure">1</ref>).</p><p>Performing spatial join queries on a large volume of complex geometries is challenging due to the high computational cost <ref type="bibr">[3]</ref>. Exhaustive search compares all pairs of geometries, resulting in &#119874; (|&#119877;| &#8226; |&#119878; |) time complexity, which is impractical for large datasets <ref type="bibr">[40]</ref>. Thus, spatial indexes are necessary to improve query performance by reducing search space. This section introduces two typical tree-based and grid-based spatial indexes, namely BVH and uniform grid. To clarify, in graphics, BVH is a general term to describe a hierarchical data structure that partitions geometries with bounding volume. For example, the R-tree can be categorized into BVH because it uses Minimum Bounding Rectangle (MBR) as the bounding volume <ref type="bibr">[19]</ref>.</p><p>BVH. The BVHs organize geometries in tree-like data structures. The internal node in the BVH is associated with an axis-aligned bounding box (AABB) <ref type="foot">3</ref> , which tightly encloses the union of the children's AABBs. Leaf nodes hold the primitives <ref type="foot">4</ref> , which can be points, lines, triangles, or any shapes. AABB approximates the primitive boundary, making the intersection test more efficient. Building a BVH can be efficiently converted into a GPU-friendly</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Sequential memory access in each cell</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Random memory access</head><p>Logical pointers sorting problem. For example, Linear Bounding Volume Hierarchy (LBVH) organizes geometries in the order of their Morton Code, balancing the construction time and index quality <ref type="bibr">[26]</ref>.</p><p>Figure <ref type="figure">2</ref> (a) depicts a query &#119876; &#119864; being performed on a scene composed of four triangles. Figure <ref type="figure">2</ref> (b) illustrates how to accelerate the query with the BVH. The query aims to identify all the triangles intersecting the query. This process begins by performing an intersection test between &#119876; &#119864; and the AABB &#119861; 1 of the root node. If an intersection is detected, further tests are conducted between &#119876; &#119864; and the nodes &#119873; 2 and &#119873; 3 . The search proceeds until we narrow it down to the leaf node &#119873; 2 . Then, we only need to execute the intersection test against the triangle bounded by &#119861; 2 . The right sub-tree rooted at &#119873; 3 can then be pruned as it does not intersect with the query.</p><p>Tree-based indexes offer logarithmic time complexity of searching, which is a key advantage. However, they can be memory inefficient. As shown in Figure <ref type="figure">2</ref> (b), tree nodes are stored in the "&#119873;&#119900;&#119889;&#119890;&#119904;" array. Logical pointers connect parent nodes to their child nodes. Each query starts by visiting the root node and following the logical pointers to its descendants, resulting in a Depth-First Search (DFS). The pointer-chasing during DFS leads to irregular memory access patterns and branches, preventing coalesced memory access optimization and resulting in cache line inefficiencies. As a result, numerous memory transactions are generated during the traversal. Determining the traversal path also results in branch divergence on the GPU, causing very few threads to be active in each warp.</p><p>Uniform Grid. The uniform grid is a spatial index that partitions the space into equal-sized cells. The width of the grid is called the resolution. Figure <ref type="figure">2</ref> (c) represents a 3x3 uniform grid built for the scene. The triangles are split into line segments for efficient retrieval. The grid data structure consists of "Cells" and "Line Segs" array. The "Cells" array maintains the beginning offsets that point to the positions in the "Line Segs" array, where the line segments within the cell are organized in a contiguous memory region.</p><p>Accelerating the query &#119876; &#119864; can be achieved through the uniform grid, as demonstrated in Figure <ref type="figure">2 (c</ref>). The cells that overlap with the query are collected (shaded with gray). Then, we scan line segments in the cells to determine whether they intersect with the query. With the grid, the line segments spatially distant from the query are pruned. The finer the grid, the more line segments are pruned, leading to improved query efficiency, but the construction cost is also increased due to duplicated line segments. Thus, the resolution of the grid must be carefully tuned.</p><p>The uniform grid offers several advantages, including GPUfriendly scan operations and a low construction cost. If geometries are distributed uniformly, the grid-based indexes should be ideal candidates for GPUs. Unfortunately, this is not true in real-world applications. For example, geometries representing buildings are dense in metropolitan cities but usually sparse in deserts, which forms an uneven grid, limiting the parallelism of GPUs and leading to load imbalance and low occupancy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Ray-Tracing and Its Programming Model</head><p>Ray tracing is a technique for rendering photorealistic images. The fundamental principle of ray tracing is identifying the primitives hit by rays, which is time-consuming, typically requiring an Accelerating Structure (AS) to reduce search space. Currently, BVH is the most widely utilized AS in ray tracing frameworks, as discussed in &#167;2.1. Despite their efficiency, tree-based indexes are not optimal for use on GPUs. As a result, researchers have proposed various hardware-based solutions to accelerate BVH traversal. For example, NVIDIA RTX series GPUs provide dedicated ray tracing units to accelerate both BVH traversal and ray-primitive intersection test <ref type="bibr">[8]</ref>. Several ray-tracing interfaces are available for the ray-tracing hardware, including NVIDIA OptiX, Vulkan, and DirectX Raytracing. This paper focuses on OptiX as it is interoperable with CUDA.</p><p>OptiX supports the "Multiple Instruction, Multiple Data" (MIMD) subset of CUDA <ref type="bibr">[50]</ref>. Each thread casts a ray, and the execution of the ray could be rescheduled at runtime to a different lane, warp, and even Streaming Multiprocessor (SM) for execution efficiency. Thus, shared memory and warp/block synchronization intrinsics are not available. A ray is allowed to carry per-ray data via registers to communicate to shaders. Shaders are user-defined CUDA programs to cast rays and handle ray-primitive intersection events.</p><p>Before casting rays, we have to build the BVH from an array of primitives. The BVH structure is hardware-specific and handled by the video driver. Thus, users have no control over the BVH construction process but only supply primitives. Constructing a BVH is expensive. The construction time is linearly correlated to the number of primitives <ref type="bibr">[83]</ref>. Upon completion of the BVH construction, a traversal handle is returned, which is used to access the BVH. The BVH traversal occurs on the RT Cores, which share the same device memory with the SMs.</p><p>Figure <ref type="figure">3</ref> illustrates the execution flow of OptiX (shapes shaded in grey are non-programmable procedures). OptiX does not allow explicit BVH traversal and instead begins by generating a set of rays. The entry point of a ray tracing program is the RayGen (RG) shader, where the traversal handle, origins, and directions of the rays are provided to generate rays and identify hits. The BVH is</p><p>SMs RT Cores TL 1 TL 2 RayGen (RG) Traversal Handle Ray primitive intersect? BVH Traversal (TL) IsIntersection (IS) AnyHit (AH) Found a hit? ClosestHit (CH) Miss (MS) CH IS RG Yes No Traversal completes Yes No Leaf node TL 3 AH Figure 3: Top: Switching between the RT Cores and SMs; Bottom: Execution flow of OptiX. autonomously traversed, guided by the rays. The traversal stage is non-programmatic and transparent to the user. Figure 3 top shows the execution flow of the query on Figure 2 (b), where &#119879; &#119871; &#119894; corresponds to the traversal of BVH node &#119873; &#119894; . If leaf nodes are reached during the traversal, the IsIntersection (IS) shader is invoked to determine if a ray intersects with a primitive. When the IS shader reports an intersection, the AnyHit (AH) shader is called, allowing the user to execute rendering tasks. At the end of the traversal, the ClosestHit (CH) shader is called, which returns the closest primitive hit by the ray. The Miss (MS) shader is invoked if the ray does not hit any primitive. In OptiX, the coordinates of primitives and ray parameters are defined in the FP32 for performance reasons.</p><p>This limitation makes implementing high-precision applications challenging. Thus, this issue must be addressed to accommodate broad applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">SPATIAL JOIN OPERATIONS ON RT CORES 3.1 From Spatial Join to Ray-tracing</head><p>Ray Definition. A ray R is a semi-infinite line characterized by its origin &#119874;, a direction vector &#236; &#119889;, and a parameter &#119905;. The parameter &#119905; allows the ray to extend indefinitely in the direction of &#236; &#119889; starting from &#119874;. Formally, it is defined in Equation (1):</p><p>When casting a ray, users are asked to provide a range parameter [&#119905; &#119898;&#119894;&#119899; , &#119905; &#119898;&#119886;&#119909; ] to filter the intersections on a segment of the ray (Figure <ref type="figure">4 (a)</ref>). If the ray hits a primitive (such as an AABB), a hit event is reported only if &#119905; &#119898;&#119894;&#119899; &#8804; &#119905; &#8462;&#119894;&#119905; &#8804; &#119905; &#119898;&#119886;&#119909; , where &#119874; + &#119905; &#8462;&#119894;&#119905; &#8226; &#236; &#119889; is the intersection point.</p><p>An essential step in ray tracing is finding ray-primitive intersections. Interestingly, this process bears the similarity of spatial join problems. For example, a complex polygon overlay query can be decomposed into two sub-problems, LSI and PIP <ref type="bibr">[37]</ref>, which can then be formulated as two RT-friendly problems.</p><p>Casting Rays for LSI. Given a base map &#119877; and a query map &#119878;, LSI finds all the intersections from &#119877; &#8882;&#8883; &#119894;&#119899;&#119905;&#119890;&#119903;&#119904;&#119890;&#119888;&#119905; &#119878;. This can be achieved by ray tracing. We first construct a BVH from an array of AABBs, where each AABB encloses a line segment from &#119877;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Ray</head><p>Hit Point AABB line segment from &#119878; can be represented by a ray. Subsequently, we cast rays generated from &#119878; and collect all intersections (Figure 4 (b)). The following ray parameters &#119874;, &#236; &#119889;, &#119905; &#119898;&#119894;&#119899; , and &#119905; &#119898;&#119886;&#119909; defines a line segment &#119904; &#8712; &#119878;, where &#119901; 1 and &#119901; 2 are the two endpoints of &#119904;.</p><p>, which is another endpoint. Thus, we can see that any intersection point &#119901; &#119909; on &#119904; can be represented by</p><p>, where &#119905; &#119898;&#119894;&#119899; &#8804; &#119905; &#8804; &#119905; &#119898;&#119886;&#119909; , and the ray-AABB intersection test identifies these hit points. Algorithm 1 illustrates the above idea, which is orchestrated by OptiX's callbacks. RayGen shader is the entry of the algorithm that computes &#119874; and &#236; &#119889; using the endpoints. At Line 8, &#119900;&#119901;&#119905;&#119894;&#119909;&#119879; &#119903;&#119886;&#119888;&#119890; shoots a ray carrying the line segment ID (per-ray data) from &#119878; to find intersections. If the ray hits an AABB, a user-defined function IsIntersection is triggered to verify the intersection and calculate the coordinate of intersection point &#119901; &#119909; and the ray parameter &#119905; &#8462;&#119894;&#119905; . This step is necessary because a ray hitting an AABB does not guarantee the ray intersects the line segment enclosed by the AABB. If &#119901; &#119909; exists, we call &#119900;&#119901;&#119905;&#119894;&#119909;&#119877;&#119890;&#119901;&#119900;&#119903;&#119905;&#119868;&#119899;&#119905;&#119890;&#119903;&#119904;&#119890;&#119888;&#119905;&#119894;&#119900;&#119899;, which reports the intersections and extra information with attributes to OptiX.</p><p>Upon the reported intersections, if &#119905; &#8462;&#119894;&#119905; falls within the interval [&#119905; &#119898;&#119894;&#119899; , &#119905; &#119898;&#119886;&#119909; ], the AnyHit shader is automatically triggered. Remember that each line segment corresponds to an AABB, and we have verified the ray indeed intersects the line segment enclosed by the AABB. We can now safely put the intersection into result set &#119876;. At line 27, we invoke the &#119900;&#119901;&#119905;&#119894;&#119909;&#119868;&#119892;&#119899;&#119900;&#119903;&#119890;&#119868;&#119899;&#119905;&#119890;&#119903;&#119904;&#119890;&#119888;&#119905;&#119894;&#119900;&#119899; function to ask OptiX to continue searching.</p><p>Casting Rays for PIP. To begin with, we want to clarify that the classical PIP test determines whether a point lies inside or outside a polygon. This can be addressed with the crossing number algorithm <ref type="bibr">[60]</ref>. However, we are solving a more complex PIP problem here. Given a query point and a group of polygons, we want to know which polygon the point falls in. Specifically, this problem is also called "Point Location" in some literature <ref type="bibr">[56]</ref>. This operation is the building block of complex spatial queries like polygon overlay <ref type="bibr">[37]</ref>. Figure <ref type="figure">5</ref> presents an example polygon alongside its in-memory representation, which will be used to demonstrate how the PIP</p><p>Algorithm 1: Formulate LSI as a ray-tracing problem Input : &#119877; -a set of line segments from base map Input :&#119878; -a set of line segments from query map Input :&#8462; -the BVH traversal handle built from &#119877; Output :&#119876; -results of intersections 1 procedure RayGen // Entrypoint, invoked when the program starts 2 &#119905; &#119898;&#119894;&#119899; , &#119905; &#119898;&#119886;&#119909; = {0, 1} 3 for each &#119904; in &#119878; do // For each line segment from &#119878; 4 &#119901; 1 , &#119901; 2 = &#119866;&#119890;&#119905;&#119864;&#119899;&#119889;&#119901;&#119900;&#119894;&#119899;&#119905;&#119904; (&#119904; ) // Get two endpoints from line segment &#119904; 5 &#119874; = &#119901; 1 // Ray origin 6 &#236; &#119889; = &#119901; 2 -&#119901; 1 // Ray direction 7 // Cast a ray carrying line segment ID of &#119904; 8 &#119900;&#119901;&#119905;&#119894;&#119909;&#119879; &#119903;&#119886;&#119888;&#119890; (&#8462;, &#119874;, &#236; &#119889;, &#119905; &#119898;&#119894;&#119899; , &#119905; &#119898;&#119886;&#119909; , &#119901;&#119886;&#119910;&#119897;&#119900;&#119886;&#119889;&#119904; (&#119894;&#119889; &#119904; ) ) 9 end for 10 end procedure 11 12 procedure IsIntersection // Invoked when ray potentially hits an AABB 13 &#119874; = &#119900;&#119901;&#119905;&#119894;&#119909;&#119866;&#119890;&#119905;&#119877;&#119886;&#119910;&#119874;&#119903;&#119894;&#119892;&#119894;&#119899; ( ) // Ray origin 14 &#236; &#119889; = &#119900;&#119901;&#119905;&#119894;&#119909;&#119866;&#119890;&#119905;&#119877;&#119886;&#119910;&#119863;&#119894;&#119903;&#119890;&#119888;&#119905;&#119894;&#119900;&#119899; ( ) // Ray direction 15 &#119894;&#119889; &#119903; = &#119900;&#119901;&#119905;&#119894;&#119909;&#119866;&#119890;&#119905;&#119875;&#119903;&#119894;&#119898;&#119894;&#119905;&#119894;&#119907;&#119890;&#119868;&#119899;&#119889;&#119890;&#119909; ( ) // Line segment id from &#119877; 16 &#119894;&#119889; &#119904; = &#119900;&#119901;&#119905;&#119894;&#119909;&#119866;&#119890;&#119905;&#119875;&#119886;&#119910;&#119897;&#119900;&#119886;&#119889; 0 ( ) // Line segment id from &#119878; 17 &#119901; = &#119877; [&#119894;&#119889; &#119903; ] &#8882;&#8883; &#119894;&#119899;&#119905;&#119890;&#119903;&#119904;&#119890;&#119888;&#119905;&#119904; &#119878; [&#119894;&#119889; &#119904; ] // Calculate intersection point 18 if &#119901; &#8800; &#120601; then 19 &#119905; &#8462;&#119894;&#119905; = &#119901;&#119909; .&#119909; -&#119874; .&#119909; &#236; &#119889; .&#119909; // Calculate &#119905; &#8462;&#119894;&#119905; with x-coordinate 20 &#119900;&#119901;&#119905;&#119894;&#119909;&#119877;&#119890;&#119901;&#119900;&#119903;&#119905;&#119868;&#119899;&#119905;&#119890;&#119903;&#119904;&#119890;&#119888;&#119905;&#119894;&#119900;&#119899; (&#119905; &#8462;&#119894;&#119905; , &#119886;&#119905;&#119905;&#119903;&#119894;&#119887;&#119906;&#119905;&#119890;&#119904; (&#119894;&#119889; &#119903; , &#119894;&#119889; &#119904; , &#119901; ) ) 21 end if 22 end procedure 23 24 procedure AnyHit // Invoked when intersections being reported 25 &#119894;&#119889; &#119903; , &#119894;&#119889; &#119904; , &#119901; &#119909; = &#119900;&#119901;&#119905;&#119894;&#119909;&#119866;&#119890;&#119905;&#119860;&#119905;&#119905;&#119903;&#119894;&#119887;&#119906;&#119905;&#119890; 0...2 ( ) // From optixReportIntersection 26 &#119876; = &#119876; &#8746; (&#119894;&#119889; &#119903; , &#119894;&#119889; &#119904; , &#119901; &#119909; ) // Collect the intersection 27 &#119900;&#119901;&#119905;&#119894;&#119909;&#119868;&#119892;&#119899;&#119900;&#119903;&#119890;&#119868;&#119899;&#119905;&#119890;&#119903;&#119904;&#119890;&#119888;&#119905;&#119894;&#119900;&#119899; ( ) // Ask OptiX to continue searching 28 end procedure Ray Closest-Hit p 4 p 5 f 2 f 1 f 0 f 0 p 1 p 2 p 3 p 6 p 7 p 8 p 9 p 10 p 11 Query Point (a) PIP query with ray-tracing (b) Chain representation x y O Chain Left Face Right Face Points C 1 f 0 f 1 p 3 , p 2 , p 1 , p 7 C 2 f 2 f 1 p 7 , p 3 C 3 f 2 f 0 p 3 , p 4 , p 5 , p 6 , p 7 C 4 f 0 f 2 p 8 , p 9 , p 10 , p 11 , p 8 p p x Figure 5: Formulating PIP as a ray tracing problem and the chain representation of the left polygon. Notably, &#119862; 4 is a hole, so it only involves two adjacent faces, resulting in a "closed" point sequence (the first point also appears as the last point). works with ray tracing. The point sequence (&#119901; 1 , &#119901; 2 , . . . , &#119901; 7 , &#119901; 1 ) consists of the polygon's outline. The line segment (&#119901; 7 , &#119901; 3 ) splits the polygon into two faces<ref type="foot">foot_3</ref> &#119891; 1 and &#119891; 2 . The diamond-shaped polygon (&#119901; 8 , &#119901; 9 , &#119901; 10 , &#119901; 11 , &#119901; 8 ) on the bottom is a hole with face &#119891; 0 . A chain is a polyline that belongs to the same adjacent faces. For instance, the polyline (&#119901; 3 , &#119901; 2 , &#119901; 1 , &#119901; 7 ) should be organized as a chain because the line segment (&#119901; 3 , &#119901; 2 ), (&#119901; 2 , &#119901; 1 ), and (&#119901; 1 , &#119901; 7 ) share the same left face &#119891; 0 , and right face &#119891; 1 . The order of points determines the adjacent faces of a polyline. For example, in chain &#119862; 2 , the left and right faces are &#119891; 2 and &#119891; 1 , respectively. If the point sequence is (&#119901; 3 , &#119901; 7 ), the left face is &#119891; 1 and the right face is &#119891; 2 . The chain representation saves memory by storing common boundaries only once and facilitates the PIP algorithm by providing neighborhood information.</p><p>The idea of PIP is to shoot a ray from the query point vertically upwards and to identify the closest line segment hit by the ray <ref type="bibr">[14,</ref><ref type="bibr">37]</ref>. This method works both on convex and non-convex polygons, even polygons with holes. The simulation of ray casting is traditionally achieved with grids or trees, which have many shortcomings, as mentioned in &#167;2. Since the PIP problem is essentially a ray tracing problem, solving it with RT Cores is natural. The algorithm works as follows. Figure <ref type="figure">5</ref> (a) shows a query &#119901;, locating which polygon the query falls in. We shoot a vertical ray from &#119901; upwards to identify the closest line segment hit by the ray, which is (&#119901; 6 , &#119901; 7 ) in chain &#119862; 3 . Then, we use the chain table (Figure <ref type="figure">5 (b)</ref>) to determine the face where the query point is located. Since the x-coordinate of point &#119901; 6 is greater than that of &#119901; 7 , we take the left face &#119891; 2 of chain &#119862; 3 , as the polygon ID.</p><p>Algorithm 2 describes the implementation of PIP with OptiX. Again, the &#119877;&#119886;&#119910;&#119866;&#119890;&#119899; shader is the entry point of the algorithm, starting from casting vertical rays from every point &#119901; &#8712; &#119878;. In IsIntersection shader, we calculate the distance &#119905; &#8462;&#119894;&#119905; between the query point &#119901; and the hit point &#119901; &#119909; , which is calculated by solving the common solution of line equations (Line 15). Since the ray may hit multiple line segments, we report each &#119905; &#8462;&#119894;&#119905; to OptiX by invoking optixRe-portIntersection, asking OptiX to find out the closest hit. When a ray hits the closest line segment, the &#119862;&#119897;&#119900;&#119904;&#119890;&#119904;&#119905;&#119867;&#119894;&#119905; shader will be called, where we determine which polygon the query falls in. optixGetPayload tells the ID of the query point &#119894;&#119889; &#119901; , and optixGetPrimitiveIndex returns the closest line segment ID &#119894;&#119889; &#119897; . Finally, we retrieve the line segment &#119897; and query point &#119901; by the IDs. To tell which polygon &#119901; falls in, we check the ray casting from the left or right face of &#119897;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">High Arithmetic Precision Beyond the Hardware Support</head><p>Ray-tracing frameworks like OptiX typically use FP32 for optimal performance <ref type="bibr">[28]</ref>. In real-time rendering, such as in video games, the difference in visual quality between lower and higher precision is generally imperceptible to the human eye. However, precision levels are critical in areas like spatial databases. For example, FP32 coordinates offer meter-level distance precision, while FP64 coordinates achieve nanometer-level precision <ref type="bibr">[25]</ref>, crucial for scientific simulations and engineering analysis. Without improving the precision, the applicability of ray-tracing technology may be severely limited. Precision challenges in graphics, such as self-intersection, and robustness of ray-primitive intersection test, have been wellstudied <ref type="bibr">[22,</ref><ref type="bibr">39]</ref>. However, the issue we address here stems from floating-point downcasting and differs from these well-known problems. In this section, we present a simple yet effective method to achieve exact query results for FP64 or higher precision data sources within the constraints of limited hardware support.</p><p>To tackle the precision challenges, it is crucial to understand the source of imprecision. Figure <ref type="figure">6</ref> represents floating-point numbers Algorithm 2: Formulate PIP as a Ray-Tracing Problem Input : &#119877; -a set of line segments from base map Input :&#119878; -a set of points from the query map Input :&#8462; -the BVH traversal handle built from R Output :&#119876; -Polygon IDs; &#119876; [&#119894; ] contains the polygon ID of query &#119878; [&#119894; ] 1 procedure RayGen // Entrypoint, invoked when program starts 2 &#119905; &#119898;&#119894;&#119899; , &#119905; &#119898;&#119886;&#119909; = {0, &#8734;} 3 for each &#119901; in &#119878; do 4 &#119874; = &#119901; // Query point as origin 5 &#236; &#119889;.&#119909; &#119910;&#119911; = {0, 1, 0} // Towards upward of y-axis 6 &#119900;&#119901;&#119905;&#119894;&#119909;&#119879; &#119903;&#119886;&#119888;&#119890; (&#8462;, &#119874;, &#236; &#119889;, &#119905; &#119898;&#119894;&#119899; , &#119905; &#119898;&#119886;&#119909; , &#119901;&#119886;&#119910;&#119897;&#119900;&#119886;&#119889;&#119904; (&#119894;&#119889; &#119901; ) ) // Shoot a ray, which carries the point id from &#119878; 7 end for 8 end procedure 9 10 procedure IsIntersection // Invoked when ray potentially hits an AABB 11 &#119894;&#119889; &#119897; = &#119900;&#119901;&#119905;&#119894;&#119909;&#119866;&#119890;&#119905;&#119875;&#119903;&#119894;&#119898;&#119894;&#119905;&#119894;&#119907;&#119890;&#119868;&#119899;&#119889;&#119890;&#119909; ( ) // Line segment id from &#119877; 12 &#119894;&#119889; &#119901; = &#119900;&#119901;&#119905;&#119894;&#119909;&#119866;&#119890;&#119905;&#119875;&#119886;&#119910;&#119897;&#119900;&#119886;&#119889; 0 ( ) // Point id from &#119878; 13 &#119901; = &#119878; [&#119894;&#119889; &#119901; ] 14 &#119897; = &#119898;&#119886;&#119896;&#119890;_&#119907;&#119890;&#119903;&#119905;&#119894;&#119888;&#119886;&#119897;_&#119897;&#119894;&#119899;&#119890; (&#119901; ) // Generate a vertical line segment crossing &#119901; 15 &#119901; &#119909; = &#119877; [&#119894;&#119889; &#119897; ] &#8882;&#8883; &#119894;&#119899;&#119905;&#119890;&#119903;&#119904;&#119890;&#119888;&#119905;&#119904; &#119897; // Intersection point of the line segment from R and &#119897; 16 if &#119901; &#119909; &#8800; &#120601; then 17 &#119905; &#8462;&#119894;&#119905; = &#119901; &#119909; .&#119910; -&#119901;.&#119910; // Distance between the ray origin and the hit point 18 &#119900;&#119901;&#119905;&#119894;&#119909;&#119877;&#119890;&#119901;&#119900;&#119903;&#119905;&#119868;&#119899;&#119905;&#119890;&#119903;&#119904;&#119890;&#119888;&#119905;&#119894;&#119900;&#119899; (&#119905; &#8462;&#119894;&#119905; ) // Report intersection to OptiX 19 end if 20 end procedure 21 22 procedure ClosestHit // Invoked when a ray hit the closest line segment 23 &#119894;&#119889; &#119897; = &#119900;&#119901;&#119905;&#119894;&#119909;&#119866;&#119890;&#119905;&#119875;&#119903;&#119894;&#119898;&#119894;&#119905;&#119894;&#119907;&#119890;&#119868;&#119899;&#119889;&#119890;&#119909; ( ) // Line segment id from &#119877; 24 &#119894;&#119889; &#119901; = &#119900;&#119901;&#119905;&#119894;&#119909;&#119866;&#119890;&#119905;&#119875;&#119886;&#119910;&#119897;&#119900;&#119886;&#119889; 0 ( ) // Query point id from &#119878; 25 &#119897; = &#119877; [&#119894;&#119889; &#119897; ] 26 &#119901; 1 , &#119901; 2 = &#119866;&#119890;&#119905;&#119864;&#119899;&#119889;&#119901;&#119900;&#119894;&#119899;&#119905;&#119904; (&#119897; ) 27 if &#119901; 1 .&#119909; &lt; &#119901; 2 .&#119909; then // Check the line segment direction 28 &#119876; [&#119894;&#119889; &#119901; ] = &#119866;&#119890;&#119905;&#119877;&#119894;&#119892;&#8462;&#119905; &#119865;&#119886;&#119888;&#119890; (&#119897; ) 29 else 30 &#119876; [&#119894;&#119889; &#119901; ] = &#119866;&#119890;&#119905;&#119871;&#119890; &#119891; &#119905; &#119865;&#119886;&#119888;&#119890; (&#119897; ) 31</p><p>end if 32 end procedure 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 sign exponent mantissa sign exponent mantissa fp16 fp32 left-over bits Figure <ref type="figure">6</ref>: Downcasting a 32-bit floating-point to 16-bit. in binary format <ref type="foot">6</ref> , consisting of three components: sign, exponent, and <ref type="bibr">[1]</ref>. When downcasting, the sign and exponent bits are preserved, ensuring no loss of information. However, the mantissa undergoes truncation. Based on the default rounding rules, known as "roundTiesToEven" <ref type="bibr">[1]</ref>, the mantissa bits of the low-precision type are left as-is if the discarded bits are below 10 . . . 0, while they are incremented by one if the discarded bits are above 10 . . . 0. In the case of the discarded bits being exactly 10 &#8226; &#8226; &#8226; 0, one of the rounding rules is selected that makes the least significant bit of the mantissa bits zero. Consequently, any downcasted floating-point number can have a value that is less than, greater than, or equal to the value represented by the higher precision type. The process of downcasting high-precision data sources to lowprecision primitives and rays introduces geometric inconsistencies due to the loss of mantissa bits. Figure <ref type="figure">7</ref> shows cases that a line segment from dataset &#119877; intersects two rays casting from dataset &#119878;, resulting in two intersections at x-coordinate &#119909; 1 and &#119909; 2 where 0 &lt; &#119909; 1 &lt; &#119909; 2 . For the sake of simplicity, let's assume that the precision issue only affects the x-axis when constructing the AABB. In Case 1, &#119909; 1 is rounded up, causing the left boundary of the AABB to fail to enclose the left endpoint of the line segment, leading to a false negative outcome. In Case 2, rounding down of &#119909; 2 causes &#119903;&#119886;&#119910; 2 to miss the AABB, as the right endpoint falls outside the box. In Case 3, neither &#119903;&#119886;&#119910; 1 nor &#119903;&#119886;&#119910; 2 intersects the bounding box, resulting in the failure to detect any intersections. These examples highlight the hidden issues that can arise when employing ray-tracing techniques in GIS applications.</p><p>1 0 0 1 0 0 1 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 1 1 0 0 1 0 0 1 0 0 0 An intuitive solution is to create slightly larger AABBs, which can accommodate the line segment after downcasting. However, determining the appropriate size of the enlarged AABB is not trivial. Simply enlarging the AABBs by a small epsilon may still lead to intersections missing, while excessively enlarging them can result in many false positive hits, hurting performance. Formally, let &#119909; &#119897; denote the x-coordinate of the line segment's left endpoint in high precision. We need to find the maximum value expressible in FP32 as the downcasted left AABB boundary &#119909; &#8242; &#119897; , satisfying &#119909; &#8242; &#119897; &#8804; &#119909; &#119897; . Similarly, &#119909; &#119903; represents the right endpoint in high precision. We aim to find the minimum value &#119909; &#8242; &#119903; as the downcasted right boundary, ensuring &#119909; &#8242; &#119903; &#8805; &#119909; &#119903; . The same consideration applies to the y-axis. The solution is to create the low-precision AABB that is just barely large enough to fully enclose the high-precision line segment by tweaking mantissa bits. In the scenario depicted in Figure <ref type="figure">8</ref>, &#119909; &#119903; is the x-coordinate of the line segment's right endpoint in high precision. Upon downcasting, this value is rounded down, resulting in &#119909; &#119889; &#119903; , which is less than &#119909; &#119903; . To determine the smallest possible value &#119909; &#8242; &#119903; as the new right boundary, we only need to add 1 to the mantissa bits after downcasting. This ensures that &#119909; &#8242; &#119903; just exceeds &#119909; &#119903; . Additionally, to compensate for errors caused by downcasting of the ray origin, we add 1 again to the mantissa bits. To sum up, we add two to the mantissa bits of the right boundary and subtract two from the mantissa bits of the left boundary. Adjusting the mantissa bits may not always be necessary, such as if the &#119909; &#119903; is rounding up. However, we perform this operation unconditionally to avoid checking all the rounding cases, minimize branches, and enhance execution efficiency on GPUs. Hence, we refer to this approach as Conservative Representation (CR).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Taming High BVH Construction Overhead</head><p>The number of AABBs used to build the BVH equals the line segments in the dataset because each line segment is bounded by an individual AABB ( &#167;3.1). This bounding strategy introduces a long construction time and huge memory consumption 7 because the buildup cost is linearly correlated to the number of primitives <ref type="bibr">[83]</ref>. One may impulsively use the Ramer-Douglas-Peucker algorithm to simplify polylines and then create an AABB for each simplified polyline <ref type="bibr">[54]</ref>. Yet, this algorithm's recursive and sequential nature renders it inefficient for GPU execution. Therefore, we introduce a GPU-optimized method, termed Adaptive Grouping (AG), to group spatially close line segments into the same AABB.</p><p>We first introduce the grouping strategy. For two AABBs &#119861; &#119894; and &#119861; &#119895; , the union operator &#8746; is defined as &#119861; = &#119861; &#119894; &#8746; &#119861; &#119895; , where &#119861; is their MBR, and |&#119861;| represents its area. Considering &#119899; line segments &#119871; 1 , &#119871; 2 , . . . , &#119871; &#119899; each enclosed by AABBs &#119861; 1 , &#119861; 2 , . . . , &#119861; &#119899; , we only merge adjacent AABBs in the sequence, such as &#119861; &#119894; , &#119861; &#119894;+1 , . . . , &#119861; &#119895; (where 1 &#8804; &#119894; &lt; &#119895; &#8804; &#119899;). This facilitates representing a line segment subset with a simple range &#119877; [&#119894;, &#119895;] instead of explicitly storing each segment's ID. With the range &#119877;, we linearly scan the line segments bounded by &#119861; to find which line segment is hit by the ray. Merging &#119861; &#119894; and &#119861; &#119895; occurs only if the area expansion ratio</p><p>where &#119904; is the merging threshold. We will discuss how the parameter &#119904; impacts performance and how to find an optimal &#119904; in &#167;5.5. This approach is inspired by R-Tree's heuristic insertion algorithm <ref type="bibr">[17]</ref>, which effectively curtails the occurrence of false positives by limiting AABB's dead space.</p><p>Figure <ref type="figure">9</ref> demonstrates AG's execution. Initially, each line segment (&#119871; 0 to &#119871; 7 ) is in a unique group (&#119866; 0 to &#119866; 7 ), with a merging threshold &#119904; = 1.5. In the first iteration (Figure <ref type="figure">9</ref> (b)), &#119871; 0 and &#119871; 1 form &#119866; 0 as their area expansion ratio (1.4) is below &#119904;. However, &#119871; 2 and &#119871; 3 remain separate since</p><p>The second iteration includes &#119871; 2 in &#119866; 0 and &#119871; 6 in &#119866; 2 , but excludes &#119871; 7 from &#119866; 2 due to a ratio of 1.8, above the threshold. The final iteration merges &#119871; 3 into &#119866; 0 , resulting in three groups: &#119866; 0 with &#119861; 0 and range &#119877; 0 [0, 3], &#119866; 1 with &#119861; 1 and range &#119877; 1 <ref type="bibr">[4,</ref><ref type="bibr">6]</ref>, and &#119866; 2 with &#119861; 2 and range &#119877; 2 <ref type="bibr">[7,</ref><ref type="bibr">7]</ref>.</p><p>AG can be efficiently implemented on the GPU. We divide &#119871; segments into &#8968; &#119871; &#119887; &#8969; partitions, where &#119887; is the thread block size. 7 One may argue that building the BVH is a one-off cost. A fast buildup allows users to work on exploratory and interactively workloads, e.g., try out different datasets and run some ad-hoc queries quickly.</p><p>TB 1 TB 2 (a) Initial Line Segments (b) Iteration 1 (c) Iteration 2</p><p>2 TB 1 TB 2 TB 1 TB 2 Figure 9: Execution process of Adaptive Grouping. Initially, a thread block (TB) processes a fixed number of line segments. Spatially close line segments are iteratively put together. Each thread block processes partition &#119894; containing line segments &#119871; &#119894; * &#119887; , &#119871; &#119894; * &#119887;+1 , . . . , &#119871; (&#119894;+1) * &#119887; . In the first iteration, thread &#119895; in thread block &#119894; attempts to merge two adjacent AABBs &#119861; &#119894; * &#119887;+&#119895; and &#119861; &#119894; * &#119887;+&#119895;+1 , which encloses line segments &#119871; &#119894; * &#119887;+&#119895; and &#119871; &#119894; * &#119887;+&#119895;+1 respectively. The thread either produces a merged AABB &#119861; &#119894; * &#119887;+&#119895; &#8746; &#119861; &#119894; * &#119887;+&#119895;+1 and a range &#119877; [&#119894; * &#119887; + &#119895;, &#119894; * &#119887; + &#119895; + 1] if the area expanding ratio is within the threshold &#119904;. Otherwise, the AABBs remain unaltered. Subsequent iterations merge two adjacent AABBs from the previous iteration, continuing until no further merges are possible.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">RAYJOIN DEPLOYMENT IN PRACTICE 4.1 Implementation Details</head><p>Data Storage. The dataset is stored in a planar graph format on host memory and is subsequently decomposed into line segments on device memory for further processing. Each line segment references endpoint IDs instead of storing the coordinates to conserve memory and improve the locality. We pre-compute line equation coefficients from the endpoints. This pre-calculation enables the LSI and PIP queries to access coefficients directly, avoiding redundant calculations. The line segments are then fed into the AG to create AABBs and ranges, followed by applying CR to determine the new boundaries of the AABBs, ensuring precise results.</p><p>Coordinates Representation. We employ the fixed-point notation to resolve the accumulated errors from the floating-point arithmetic operations. Given that GPUs do not possess built-in support for fixed-point arithmetic, we convert the floating-point numbers to integers through scaling. This allows all subsequent operations to be performed exclusively on integers, effectively circumventing the issue of precision loss. Scaling to integers also benefits performance because NVIDIA RTX GPUs tend to have very limited FP64 units. Furthermore, we store the intersection points with rational numbers produced by LSI, thereby eliminating the need for fractional coordinates.</p><p>Degenerate Cases Handling. In large-scale datasets, degenerate cases such as overlapped line segments for LSI or points on line segments for PIP may occur. These cases must be resolved for correctness. A well-known method for handling degenerate cases is Simulation of Simplicity (SOS) <ref type="bibr">[9]</ref>. SOS eliminates all degeneracies by simulating the addition of an infinitesimal value to the coordinates. We employ SOS to handle degenerate cases.</p><p>Eliminating Invocation Overhead. Algorithm 1 and 2 conceptually introduce the implementation of RayJoin, where we utilize AnyHit and ClosestHit shaders to collect results. However, employing these shaders may introduce invocation overhead. To mitigate this, both AnyHit and ClosestHit shaders are disabled. We combine the logic of intersection testing and result collection into a single &#119868;&#119904;&#119868;&#119899;&#119905;&#119890;&#119903;&#119904;&#119890;&#119888;&#119905;&#119894;&#119900;&#119899; shader, thereby eliminating the overhead <ref type="bibr">[46]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">RT Cores Acceleration Enabled Applications</head><p>We also discovered a subset of &#119878;&#119879; functions in geodatabases that may benefit from RT Cores <ref type="bibr">[21]</ref>. We only present the ideas of translating them into ray tracing problems. Implementing these operators with RayJoin in geo-database systems is beyond the scope of this paper, and we leave it as future work.</p><p>ST_Touches. Given two set of geometric objects &#119860; and &#119861;, the function &#119878;&#119879; _&#119879;&#119900;&#119906;&#119888;&#8462;&#119890;&#119904; returns true if &#119862;&#119900;&#119899;&#119889;&#119894;&#119905;&#119894;&#119900;&#119899; 1: &#119860; and &#119861; have at least on common point, and &#119862;&#119900;&#119899;&#119889;&#119894;&#119905;&#119894;&#119900;&#119899; 2: the common points cannot lie in the interior of &#119860; and &#119861;. Supposing &#119860; are polygons, we build a BVH from &#119860;. If &#119861; are points, we cast very short rays as approximations of the points to search hits. If &#119861; are polygons or polylines, we use Algorithm 1. This fetches all the potential touches, which can be filtered respected with Condition 1. The Condition 2 can be verified with Algorithm 2.</p><p>ST_Intersects. This function returns true if the intersection of geometry &#119860; and &#119861; is not an empty set. We illustrate the idea due to limited space: a complex geometric object can always be split into primitives (point, line segment). We build a BVH with the primitives decomposed from &#119860;. Then we shoot rays depending on the type of primitives from &#119861; (a very short ray as a point or a long ray as a line segment). This allows us to collect potentially intersected geometries. Then, we filter the results that satisfy the intersection condition.</p><p>ST_Intersection. This function returns the intersected geometry from set &#119860; and &#119861;. This is one of the most complicated spatial functions and is very compute-intensive. Yet, it benefits from RayJoin. Assuming &#119860; and &#119861; are both polygons, which is the most complicated case, we (1) build two BVHs from &#119860; and &#119861;, (2) run the Algorithm 1 to find all intersections, (3) run Algorithm 2 to determine which polygon the points fall in, and (4) create the resulting polygon by connecting the points of intersection and the points that lie in the interior of polygons. This process is exactly the polygon overlay analysis, which is evaluated in &#167;5.7.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">EVALUATION 5.1 Evaluation Settings</head><p>Baselines. Table <ref type="table">1</ref> lists the artifacts we have evaluated. We implemented the uniform grid and LBVH-based solutions on the GPU, which support all the queries <ref type="bibr">[26]</ref>. cuSpatial is a GPU-accelerated geospatial analytical platform from NVIDIA <ref type="bibr">[2]</ref>. RasterJoin is a raster-based solution on the GPU for spatial aggregation <ref type="bibr">[79]</ref>. PSSL is a state-of-the-art plane-sweep-based algorithm for LSI on GPUs <ref type="bibr">[15]</ref>. To the best of our knowledge, GLIN is the only learned spatial index that supports indexing geometries with extends <ref type="bibr">[68]</ref>. With OpenMP, we use GLIN to implement a parallel LSI query. EPUG-Overlay is a parallel polygon overlay program with a two-level uniform grid <ref type="bibr">[37]</ref>. Kinetica and PostGIS are geospatial databases <ref type="bibr">[20]</ref>. RasterIntervals uses a raster-based method to accelerate polygon overlay query <ref type="bibr">[16]</ref>.</p><p>Datasets. We collected a diverse set of real-world datasets shown in Table <ref type="table">2</ref>, including US-based maps such as County, Zipcode, Block, and Water from ArcGIS Hub <ref type="bibr">[11]</ref> and global maps, such as Lakes and Parks from OpenStreetMap (OSM) <ref type="bibr">[10]</ref>. We partitioned the OSM datasets based on continents/regions. The spatial join operations were exclusively performed within datasets originating from the same geographic region. &#119877; &#8882;&#8883; &#119878; represents the join operation, where &#119877; a base map and &#119878; is a query map. The spatial index is built from &#119877;, and queries are generated from &#119878;. For the scalability experiments, we use Spider to generate synthetic datasets <ref type="bibr">[27]</ref>. The polygon size and segment length are set to 0.001 and 10, respectively. The Gaussian distribution is set to &#120583; = 0.5, &#120590; = 0.1.</p><p>Parameters. Grid-based solutions need a predefined resolution, and trying all the possible resolutions to get the best is prohibitively expensive. By trials, we found that 15000x15000 for the uniform grid yields low running time for all datasets. For EPUG-Overlay, the first-level grid is set to 4000x4000, and the second-level grid is set to 40x40, a typical value used in <ref type="bibr">[37]</ref>. RayJoin only needs one parameter, the merging threshold &#119904;, in the adaptive grouping algorithm, which is fixed to 3.5. We discuss how to find an optimal parameter in &#167;5.5.</p><p>Environments. RayJoin is implemented in OptiX 8.0 and CUDA 12.3. We evaluate GPU-based solutions on a workstation with Intel Core i9-12900, NVIDIA RTX 3090, and 64 GB memory. cuSpatial version is 23.12.00. For the CPU-based solutions, we employed them on AWS EC2 using an r5.4xlarge instance with 16 vCPUs and 128 GB of RAM. The version of PostGIS is PostGIS-3.3, and Kinetica is v7.1.9.10-ga1. Note that Kinetica has GPU acceleration but it does not support our GPU model and requires a license purchase for self-hosted installation, so we have to evaluate its CPU version on AWS Marketplace Platform.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">LSI Performance</head><p>Table <ref type="table">3</ref> top shows the running time of LSI queries. For the conventional spatial indexes, LBVH is consistently better than the uniform grid. Our analysis attributes the grid-based method's inferior performance to skew distribution issues. For example, in the &#119862;&#119900;&#119906;&#119899;&#119905;&#119910; &#8882;&#8883; &#119885;&#119894;&#119901;&#119888;&#119900;&#119889;&#119890;, almost 100% of cells have workloads (number of intersection tests) under 223K per cell, accounting for 70.97% of total workloads. Yet, about 30% of the workloads are concentrated in just 90 cells (the total number of cells is 225M). PSSL's performance sits between the grid-based and tree-based methods, with its limited parallelism likely hindering the best utilization of the GPU. Notably, GLIN emerges as the least efficient method, with the majority of its processing time devoted to the query stage.</p><p>RayJoin demonstrates superior performance across all baselines. Focusing solely on processing time, RayJoin achieves speedups   <ref type="figure">10</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">PIP Performance</head><p>The bottom half of Table <ref type="table">3</ref> reveals that the uniform grid surpasses LBVH for large datasets, although its performance is hindered by skewed data. The grid's advantage lies in its ability to partition space, allowing for the termination of searches as soon as the closest line segment to a query point is located. In contrast, LBVH requires examining all intersecting line segments to identify the closest one.</p><p>RasterJoin, however, shows underperformance compared to both grid and LBVH and runs time out on &#119871;&#119870;&#119873; &#119860; &#8882;&#8883; &#119875;&#119870;&#119873; &#119860; in the polygon triangulation stage. For instance, in the &#119862;&#119900;&#119906;&#119899;&#119905;&#119910; &#8882;&#8883; &#119885;&#119894;&#119901;&#119888;&#119900;&#119889;&#119890; dataset, RasterJoin spent approximately 6878ms on point rendering and 5371ms on polygon rendering, a likely consequence of rendering excessive primitives in large datasets. Notably, NVIDIA's spatial library fell short of the LBVH and ran out-of-memory (OOM) on the &#119871;&#119870;&#119864;&#119880; &#8882;&#8883; &#119875;&#119870;&#119864;&#119880; dataset. Co un ty Zi pc od e Bl oc k W at er LK AF PK AF LK AS PK AS LK AU PK AU LK EU PK EU LK NA PK NA LK SA PK SA 10 1 10 0 10 1 10 2 10 3 10 4 10 5 Speedup (log) OOM Timeout Uniform Grid LBVH RasterJoin cuSpatial PIP is an RT-favorable workload because only the closest hit needs to be found. This property allows us to leverage the ClosestHit shader supported by the ray-tracing hardware to reduce the amount of work instead of traversing all the geometries. If considering the processing time only (ray-tracing), RayJoin exhibited from 4.8x to 74.2x speedups over the best of baselines. As shown in Figure <ref type="figure">11</ref>, by considering the total running time, RayJoin achieved from 2.1x to 22.2x speeds over the best performance of baselines.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Evaluation of Precision</head><p>The conservative representation ensures results identical to those of the program using FP64 or higher. To validate its precision, we compared the results of RayJoin to the grid implementation in FP64 on LKNA &#8882;&#8883; PKNA. The LSI error rate is defined by</p><p>, and the PIP error rate is defined by</p><p>. Executing LSI and PIP queries with OptiX's default FP32 precision resulted in 745 intersections missing (out of 1,251,343) for LSI and 524 incorrect responses (out of 69,273,661) for PIP, yielding error rates of 0.0595% and 0.0008%, respectively. When using the conservative representation to create the AABBs, both LSI and PIP produce </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>PIP</head><p>completely the same answers compared to the FP64 program, and we have verified the correctness of all datasets. These experiments confirm the analysis in &#167;3.2, and our solution eliminates all the false negative cases due to the precision loss.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5">Effectiveness of Adaptive Grouping and</head><p>Parameter Tuning Figure 12 (a) illustrates the variation in BVH construction time, query time, and AG overhead when joining the LKNA with PKNA dataset (Table <ref type="table">2</ref>) as parameter &#119904; changes. When &#119904; is set to 1, AG is effectively disabled because two line segments are allowed to be put together only if the area of the merged bounding box does not grow, leading to high query performance but also high BVH construction time and memory usage -166ms and 2403MB, respectively (Figure <ref type="figure">12 (b-c</ref>)). Here, LSI and PIP queries take just 14ms and 18ms.</p><p>As &#119904; increases, construction costs drop significantly, but query times for both LSI and PIP begin to rise. At &#119904; = 3.5, AG achieves a 29% reduction in total execution time (147ms) and drastically lowers memory usage to about 408MB, which is just 17% of the usage without AG. Beyond &#119904; = 3.5, total time starts to increase, reaching 155ms at &#119904; = 5.0. This rise is attributed to AG merging spatially distant bounding boxes, causing a higher false positive rate and more intersection tests.</p><p>Figure <ref type="figure">12</ref> (b) demonstrates a linear correlation between construction and query times with &#119904;, particularly in the range of 1.5 &#8804; &#119904; &#8804; 4. This trend is consistent across various datasets. Optimal parameter selection could be determined through linear regression modeling of the merging threshold-execution time relationship, though this process is omitted here due to space constraints.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6">Scalability</head><p>We generated synthetic datasets with uniform and Gaussian distributions. The BVH is built from dataset &#119877; with 5M polygons. A series of datasets &#119878;, ranging from 1M to 5M polygons, were generated to evaluate RayJoin's scalability. To better understand the scalability of RayJoin, adaptive grouping was disabled, and BVH construction time was not included; only query time was reported.</p><p>1M 2M 3M 4M 5M Number of Polygons 0 5 10 15 Query Time (ms) (a) Uniform -Query Time LSI -Query Time 1M 2M 3M 4M 5M Number of Polygons 0 10 20 30 40 Query Time (ms) (b) Gaussian -Query Time LSI -Query Time 1M 2M 3M 4M 5M Number of Polygons 0 10 20 30 40 M Intersects/s (c) Uniform -Throughput LSI -Throughput 1M 2M 3M 4M 5M Number of Polygons 0 25 50 75 100 125 M Intersects/s (d) Gaussian -Throughput LSI -Throughput Figure <ref type="figure">13 (a)-(d)</ref> shows LSI query time and throughput for increasing polygon numbers. Under uniform distribution, 1M polygons result in a 2.9ms query time and 62K intersections. At 3M polygons, the time rises to 6.9ms with 185K intersections, indicating RayJoin's linear scalability, aligning well with the time complexity of &#119874; (|&#119878; |&#119897;&#119900;&#119892;|&#119877;|). Gaussian distributions yield similar patterns but with longer times due to more intersections. When the number of queries increases from 1M to 2M polygons, LSI throughput stabilizes at 29M intersections/s. In the Gaussian dataset, throughput peaks at 99M intersections/s.</p><p>For 1M uniform polygons, 8M PIP queries are issued, taking about 5.9ms (Figure <ref type="figure">14 (a)</ref>). At 2M polygons and 16M queries, the time increases to 12.1ms. In Gaussian distributions, query times are slightly longer due to more ray-primitive intersections. RayJoin</p><p>1M 2M 3M 4M 5M Number of Polygons 0 10 20 30 40 Query Time (ms) (a) Uniform -Query Time PIP -Query Time 1M 2M 3M 4M 5M Number of Polygons 0 10 20 30 40 50 Query Time (ms) (b) Gaussian -Query Time PIP -Query Time 1M 2M 3M 4M 5M Number of Polygons 0.0 0.5 1.0 1.5 G Points/s (c) Uniform -Throughput PIP -Throughput 1M 2M 3M 4M 5M Number of Polygons 0.0 0.5 1.0 1.5 2.0 G Points/s (d) Gaussian -Throughput PIP -Throughput shows linear scalability with query numbers. Figure <ref type="figure">14 (c)</ref>, <ref type="figure">(d</ref>) indicates a peak PIP throughput of around 1.5G points/s for both distributions, affirming RayJoin is highly scalable as throughput remains consistent despite increased queries.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.7">Polygon Overlay</head><p>The polygon overlay query identifies and outputs the overlapped area of intersected polygons from two maps, which can be achieved by combining the LSI and PIP query results ( &#167;2.1). The overlay processing is extremely time-consuming due to the complexity of the overlay algorithm and the irregularity of the execution.</p><p>Co un ty Zi pc od e Bl oc k W at er LK AF PK AF LK AS PK AS LK AU PK AU LK EU PK EU LK NA PK NA LK SA PK SA 10 1 10 0 10 1 10 2 10 3 10 4 10 5 Speedup (log) OOM PostGIS Kinetica EPUG-Overlay RasterIntervals Uniform Grid LBVH Table <ref type="table">4</ref> shows the polygon overlay running time. It is evident that none of the existing solutions other than RayJoin take under one second for all datasets, even the highly optimized GPU solutions (uniform grid and LBVH). The longest execution time of PostGIS is about 234s on the &#119861;&#119897;&#119900;&#119888;&#119896; &#8882;&#8883; &#119882; &#119886;&#119905;&#119890;&#119903; dataset. Kinetica is faster than PostGIS because it is memory-based, only taking about 53s. EPUG-Overlay typically takes more than hundreds of seconds for fairly large datasets. Although RasterIntervals is fast at the processing stage, it can spend thousands of seconds on preprocessing. The uniform grid takes up to 9.4s for processing &#119871;&#119870;&#119864;&#119880; &#8882;&#8883; &#119875;&#119870;&#119864;&#119880; , and LBVH even runs OOM on this dataset. LBVH takes up to 58.4s to process the &#119871;&#119870;&#119873; &#119860; &#8882;&#8883; &#119875;&#119870;&#119873; &#119860; dataset due to irregular random memory access patterns of tree-based structure and the inability to terminate the PIP search early.</p><p>RayJoin outperformed all the state-of-the-art solutions on both CPUs and GPUs. By leveraging the RT Cores, RayJoin efficiently transforms the polygon overlay problem into two RT-friendly problems. The RT Cores overcome the long memory access latency and irregular memory access patterns while benefiting from the logarithmic complexity of tree-based indexes, resulting in extremely low query time. As summarized in Figure <ref type="figure">15</ref>, RayJoin achieved from 3.0x to 28.3x speed up over the best of the baselines.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">RELATED WORK</head><p>Spatial Indexing on Advanced Hardware. Geometric Performance Primitives (GPP) is a robust polygon overlay library on GPUs with a uniform grid <ref type="bibr">[6]</ref>. Liu et al. proposed a filtering technique to accelerate spatial join with fine-grained tiles <ref type="bibr">[33,</ref><ref type="bibr">34]</ref>. Rayhan et al. designed an R-tree that leverages SIMD instructions <ref type="bibr">[55]</ref>. SwiftSpatial is a spatial join accelerator on FPGAs <ref type="bibr">[23]</ref>. Note that we target the same problem but with different hardware. In short, RayJoin is an index-based solution uniquely exploiting ray-tracing hardware's capability to deliver real-time spatial join.</p><p>Rasterization-based Techniques. RasterJoin is a rasterizationbased method for spatial aggregation by leveraging the rendering pipeline <ref type="bibr">[79]</ref>. RasterIntervals is the state-of-the-art polygon overlay method with the approximation technique to reduce the refinement cost <ref type="bibr">[16]</ref>. PixelBox employs a rasterization-based method to compute the Jaccard Similarity <ref type="bibr">[70]</ref>. IDEAL is a vector-raster hybrid method for PIP and point-to-polygon distance calculation <ref type="bibr">[65]</ref>. Compared to rasterization-based methods, RayJoin does not need any preprocessing steps to build up rasterized representations.</p><p>Parallel Plane Sweep Algorithms. McKenney et al. proposed a parallel plane sweep algorithm on CPUs <ref type="bibr">[41]</ref> that dynamically segments the input to spatial operations and executes them in parallel. PSSL is the plane sweep-based solution for the LSI query on the GPU <ref type="bibr">[15]</ref>. The plane sweep algorithms are challenging to parallelize, as demonstrated by our evaluations, which clearly indicate that they exhibit significantly lower performance than other solutions.</p><p>Accelerating Non-graphics Workload with RT Cores. An early work on repurposing RT Cores, proposed by Salmon et al., involves accelerating Monte Carlo particle transport simulations <ref type="bibr">[57]</ref>. RTNN adopted spatial reordering and query partitioning optimizations for K nearest neighbor (KNN). TrueKNN supports arbitrary radius KNN-search on RT Cores <ref type="bibr">[46]</ref>. JUNO is the first work supporting high-dimensional KNN search on RT Cores <ref type="bibr">[35]</ref>. A method presented in <ref type="bibr">[38]</ref> supports non-Euclidean distance measures for KNN. A method proposed in <ref type="bibr">[30]</ref> accelerates the PIP test by transforming polygons into 3D representations, which is different from our method and has accuracy limitations. Accelerating Tet-Mesh Point Location with RT Cores is discussed in <ref type="bibr">[67]</ref>. Literature <ref type="bibr">[24]</ref> explores using RT Cores to serve as a B-Tree. RTScan leverages RT Cores to accelerate index scans <ref type="bibr">[36]</ref>. Accelerating Range Minimum Queries with RT Cores is discussed in <ref type="bibr">[42]</ref>. None of these works address the spatial join problem that our paper targets, which presents unique challenges such as problem formulation, precision preservation, and performance optimization. </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>To understand dataset R, considering the provinces/states (the three simple polygons) share the common boundaries forms into a country (the whole polygon).</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>AABB is a synonym of MBR in this paper.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2"><p>We use the word "primitive" interchangeably with "geometry." In the context of BVH, we prefer the word "primitive" because they are conventionally used together in computer graphics.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3"><p>The term "face" defines a planar region enclosed by the boundaries. The hole and region that is not enclosed are defined as the exterior face &#119891; 0 .</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_4"><p>For ease of representation, we demonstrate the process of downcasting from 32 bits to 16 bits instead of 64 bits to 32 bits.</p></note>
		</body>
		</text>
</TEI>
